This document describes a project submitted for a degree in computer science. It discusses studying techniques for visualizing association rules discovered from databases by developed algorithms. The project aims to identify the strengths and weaknesses of these visualization techniques to determine the most appropriate for solving a main drawback of association rules, which is the huge number of extracted rules that cannot be manually inspected. The document provides background on data mining, association rules, and functional dependencies. It then outlines chapters that will explain the knowledge discovery process, association rule mining, and visualization techniques used for association rule visualization.
The huge volume of text documents available on the internet has made it difficult to find valuable
information for specific users. In fact, the need for efficient applications to extract interested knowledge
from textual documents is vitally important. This paper addresses the problem of responding to user
queries by fetching the most relevant documents from a clustered set of documents. For this purpose, a
cluster-based information retrieval framework was proposed in this paper, in order to design and develop
a system for analysing and extracting useful patterns from text documents. In this approach, a pre-
processing step is first performed to find frequent and high-utility patterns in the data set. Then a Vector
Space Model (VSM) is performed to represent the dataset. The system was implemented through two main
phases. In phase 1, the clustering analysis process is designed and implemented to group documents into
several clusters, while in phase 2, an information retrieval process was implemented to rank clusters
according to the user queries in order to retrieve the relevant documents from specific clusters deemed
relevant to the query. Then the results are evaluated according to evaluation criteria. Recall and Precision
(P@5, P@10) of the retrieved results. P@5 was 0.660 and P@10 was 0.655.
The Survey of Data Mining Applications And Feature Scope IJCSEIT Journal
In this paper we have focused a variety of techniques, approaches and different areas of the research which
are helpful and marked as the important field of data mining Technologies. As we are aware that many MNC’s
and large organizations are operated in different places of the different countries. Each place of operation
may generate large volumes of data. Corporate decision makers require access from all such sources and
take strategic decisions .The data warehouse is used in the significant business value by improving the
effectiveness of managerial decision-making. In an uncertain and highly competitive business
environment, the value of strategic information systems such as these are easily recognized however in
today’s business environment, efficiency or speed is not the only key for competitiveness. This type of huge
amount of data’s are available in the form of tera- to peta-bytes which has drastically changed in the areas
of science and engineering. To analyze, manage and make a decision of such type of huge amount of data
we need techniques called the data mining which will transforming in many fields. This paper imparts more
number of applications of the data mining and also o focuses scope of the data mining which will helpful in
the further research.
A Survey of Agent Based Pre-Processing and Knowledge RetrievalIOSR Journals
Abstract: Information retrieval is the major task in present scenario as quantum of data is increasing with a
tremendous speed. So, to manage & mine knowledge for different users as per their interest, is the goal of every
organization whether it is related to grid computing, business intelligence, distributed databases or any other.
To achieve this goal of extracting quality information from large databases, software agents have proved to be
a strong pillar. Over the decades, researchers have implemented the concept of multi agents to get the process
of data mining done by focusing on its various steps. Among which data pre-processing is found to be the most
sensitive and crucial step as the quality of knowledge to be retrieved is totally dependent on the quality of raw
data. Many methods or tools are available to pre-process the data in an automated fashion using intelligent
(self learning) mobile agents effectively in distributed as well as centralized databases but various quality
factors are still to get attention to improve the retrieved knowledge quality. This article will provide a review of
the integration of these two emerging fields of software agents and knowledge retrieval process with the focus
on data pre-processing step.
Keywords: Data Mining, Multi Agents, Mobile Agents, Preprocessing, Software Agents
Cluster Based Access Privilege Management Scheme for DatabasesEditor IJMTER
Knowledge discovery is carried out using the data mining techniques. Association rule mining,
classification and clustering operations are carried out under data mining. Clustering method is used to group up the
records based on the relevancy. Distance or similarity measures are used to estimate the transaction relationship.
Census data and medical data are referred as micro data. Data publish schemes are used to provide private data for
analysis. Privacy preservation is used to protect private data values. Anonymity is considered in the privacy
preservation process.
Data values are allowed to authorized users using the access control models. Privacy Protection Mechanism
(PPM) uses suppression and generalization of relational data to anonymize and satisfy privacy needs. Accuracyconstrained privacy-preserving access control framework is used to manage access control in relational database. The
access control policies define selection predicates available to roles while the privacy requirement is to satisfy the kanonymity or l-diversity. Imprecision bound constraint is assigned for each selection predicate. k-anonymous
Partitioning with Imprecision Bounds (k-PIB) is used to estimate accuracy and privacy constraints. Role-based Access
Control (RBAC) allows defining permissions on objects based on roles in an organization. Top Down Selection
Mondrian (TDSM) algorithm is used for query workload-based anonymization. The Top Down Selection Mondrian
(TDSM) algorithm is constructed using greedy heuristics and kd-tree model. Query cuts are selected with minimum
bounds in Top-Down Heuristic 1 algorithm (TDH1). The query bounds are updated as the partitions are added to the
output in Top-Down Heuristic 2 algorithm (TDH2). The cost of reduced precision in the query results is used in TopDown Heuristic 3 algorithm (TDH3). Repartitioning algorithm is used to reduce the total imprecision for the queries.
The privacy preserved access privilege management scheme is enhanced to provide incremental mining
features. Data insert, delete and update operations are connected with the partition management mechanism. Cell level
access control is provided with differential privacy method. Dynamic role management model is integrated with the
access control policy mechanism for query predicates.
Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...IJECEIAES
Leakage and misuse of sensitive data is a challenging problem to enterprises. It has become more serious problem with the advent of cloud and big data. The rationale behind this is the increase in outsourcing of data to public cloud and publishing data for wider visibility. Therefore Privacy Preserving Data Publishing (PPDP), Privacy Preserving Data Mining (PPDM) and Privacy Preserving Distributed Data Mining (PPDM) are crucial in the contemporary era. PPDP and PPDM can protect privacy at data and process levels respectively. Therefore, with big data privacy to data became indispensable due to the fact that data is stored and processed in semi-trusted environment. In this paper we proposed a comprehensive methodology for effective sanitization of data based on misusability measure for preserving privacy to get rid of data leakage and misuse. We followed a hybrid approach that caters to the needs of privacy preserving MapReduce programming. We proposed an algorithm known as Misusability Measure-Based Privacy Preserving Algorithm (MMPP) which considers level of misusability prior to choosing and application of appropriate sanitization on big data. Our empirical study with Amazon EC2 and EMR revealed that the proposed methodology is useful in realizing privacy preserving Map Reduce programming.
A new hybrid algorithm for business intelligence recommender systemIJNSA Journal
Business Intelligence is a set of methods, process and technologies that transform raw data into meaningful
and useful information. Recommender system is one of business intelligence system that is used to obtain
knowledge to the active user for better decision making. Recommender systems apply data mining
techniques to the problem of making personalized recommendations for information. Due to the growth in
the number of information and the users in recent years offers challenges in recommender systems.
Collaborative, content, demographic and knowledge-based are four different types of recommendations
systems. In this paper, a new hybrid algorithm is proposed for recommender system which combines
knowledge based, profile of the users and most frequent item mining technique to obtain intelligence.
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAIJMIT JOURNAL
Classification is one among the data mining function that assigns items in a collection to target categories
or collection of data to provide more accurate predictions and analysis. Classification using supervised
learning method aims to identify the category of the class to which a new data will fall under. With the
advancement of technology and increase in the generation of real-time data from various sources like
Internet, IoT and Social media it needs more processing and challenging. One such challenge in
processing is data imbalance. In the imbalanced dataset, majority classes dominate over minority classes
causing the machine learning classifiers to be more biased towards majority classes and also most
classification algorithm predicts all the test data with majority classes. In this paper, the author analysis
the data imbalance models using big data and classification algorithm
The huge volume of text documents available on the internet has made it difficult to find valuable
information for specific users. In fact, the need for efficient applications to extract interested knowledge
from textual documents is vitally important. This paper addresses the problem of responding to user
queries by fetching the most relevant documents from a clustered set of documents. For this purpose, a
cluster-based information retrieval framework was proposed in this paper, in order to design and develop
a system for analysing and extracting useful patterns from text documents. In this approach, a pre-
processing step is first performed to find frequent and high-utility patterns in the data set. Then a Vector
Space Model (VSM) is performed to represent the dataset. The system was implemented through two main
phases. In phase 1, the clustering analysis process is designed and implemented to group documents into
several clusters, while in phase 2, an information retrieval process was implemented to rank clusters
according to the user queries in order to retrieve the relevant documents from specific clusters deemed
relevant to the query. Then the results are evaluated according to evaluation criteria. Recall and Precision
(P@5, P@10) of the retrieved results. P@5 was 0.660 and P@10 was 0.655.
The Survey of Data Mining Applications And Feature Scope IJCSEIT Journal
In this paper we have focused a variety of techniques, approaches and different areas of the research which
are helpful and marked as the important field of data mining Technologies. As we are aware that many MNC’s
and large organizations are operated in different places of the different countries. Each place of operation
may generate large volumes of data. Corporate decision makers require access from all such sources and
take strategic decisions .The data warehouse is used in the significant business value by improving the
effectiveness of managerial decision-making. In an uncertain and highly competitive business
environment, the value of strategic information systems such as these are easily recognized however in
today’s business environment, efficiency or speed is not the only key for competitiveness. This type of huge
amount of data’s are available in the form of tera- to peta-bytes which has drastically changed in the areas
of science and engineering. To analyze, manage and make a decision of such type of huge amount of data
we need techniques called the data mining which will transforming in many fields. This paper imparts more
number of applications of the data mining and also o focuses scope of the data mining which will helpful in
the further research.
A Survey of Agent Based Pre-Processing and Knowledge RetrievalIOSR Journals
Abstract: Information retrieval is the major task in present scenario as quantum of data is increasing with a
tremendous speed. So, to manage & mine knowledge for different users as per their interest, is the goal of every
organization whether it is related to grid computing, business intelligence, distributed databases or any other.
To achieve this goal of extracting quality information from large databases, software agents have proved to be
a strong pillar. Over the decades, researchers have implemented the concept of multi agents to get the process
of data mining done by focusing on its various steps. Among which data pre-processing is found to be the most
sensitive and crucial step as the quality of knowledge to be retrieved is totally dependent on the quality of raw
data. Many methods or tools are available to pre-process the data in an automated fashion using intelligent
(self learning) mobile agents effectively in distributed as well as centralized databases but various quality
factors are still to get attention to improve the retrieved knowledge quality. This article will provide a review of
the integration of these two emerging fields of software agents and knowledge retrieval process with the focus
on data pre-processing step.
Keywords: Data Mining, Multi Agents, Mobile Agents, Preprocessing, Software Agents
Cluster Based Access Privilege Management Scheme for DatabasesEditor IJMTER
Knowledge discovery is carried out using the data mining techniques. Association rule mining,
classification and clustering operations are carried out under data mining. Clustering method is used to group up the
records based on the relevancy. Distance or similarity measures are used to estimate the transaction relationship.
Census data and medical data are referred as micro data. Data publish schemes are used to provide private data for
analysis. Privacy preservation is used to protect private data values. Anonymity is considered in the privacy
preservation process.
Data values are allowed to authorized users using the access control models. Privacy Protection Mechanism
(PPM) uses suppression and generalization of relational data to anonymize and satisfy privacy needs. Accuracyconstrained privacy-preserving access control framework is used to manage access control in relational database. The
access control policies define selection predicates available to roles while the privacy requirement is to satisfy the kanonymity or l-diversity. Imprecision bound constraint is assigned for each selection predicate. k-anonymous
Partitioning with Imprecision Bounds (k-PIB) is used to estimate accuracy and privacy constraints. Role-based Access
Control (RBAC) allows defining permissions on objects based on roles in an organization. Top Down Selection
Mondrian (TDSM) algorithm is used for query workload-based anonymization. The Top Down Selection Mondrian
(TDSM) algorithm is constructed using greedy heuristics and kd-tree model. Query cuts are selected with minimum
bounds in Top-Down Heuristic 1 algorithm (TDH1). The query bounds are updated as the partitions are added to the
output in Top-Down Heuristic 2 algorithm (TDH2). The cost of reduced precision in the query results is used in TopDown Heuristic 3 algorithm (TDH3). Repartitioning algorithm is used to reduce the total imprecision for the queries.
The privacy preserved access privilege management scheme is enhanced to provide incremental mining
features. Data insert, delete and update operations are connected with the partition management mechanism. Cell level
access control is provided with differential privacy method. Dynamic role management model is integrated with the
access control policy mechanism for query predicates.
Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...IJECEIAES
Leakage and misuse of sensitive data is a challenging problem to enterprises. It has become more serious problem with the advent of cloud and big data. The rationale behind this is the increase in outsourcing of data to public cloud and publishing data for wider visibility. Therefore Privacy Preserving Data Publishing (PPDP), Privacy Preserving Data Mining (PPDM) and Privacy Preserving Distributed Data Mining (PPDM) are crucial in the contemporary era. PPDP and PPDM can protect privacy at data and process levels respectively. Therefore, with big data privacy to data became indispensable due to the fact that data is stored and processed in semi-trusted environment. In this paper we proposed a comprehensive methodology for effective sanitization of data based on misusability measure for preserving privacy to get rid of data leakage and misuse. We followed a hybrid approach that caters to the needs of privacy preserving MapReduce programming. We proposed an algorithm known as Misusability Measure-Based Privacy Preserving Algorithm (MMPP) which considers level of misusability prior to choosing and application of appropriate sanitization on big data. Our empirical study with Amazon EC2 and EMR revealed that the proposed methodology is useful in realizing privacy preserving Map Reduce programming.
A new hybrid algorithm for business intelligence recommender systemIJNSA Journal
Business Intelligence is a set of methods, process and technologies that transform raw data into meaningful
and useful information. Recommender system is one of business intelligence system that is used to obtain
knowledge to the active user for better decision making. Recommender systems apply data mining
techniques to the problem of making personalized recommendations for information. Due to the growth in
the number of information and the users in recent years offers challenges in recommender systems.
Collaborative, content, demographic and knowledge-based are four different types of recommendations
systems. In this paper, a new hybrid algorithm is proposed for recommender system which combines
knowledge based, profile of the users and most frequent item mining technique to obtain intelligence.
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAIJMIT JOURNAL
Classification is one among the data mining function that assigns items in a collection to target categories
or collection of data to provide more accurate predictions and analysis. Classification using supervised
learning method aims to identify the category of the class to which a new data will fall under. With the
advancement of technology and increase in the generation of real-time data from various sources like
Internet, IoT and Social media it needs more processing and challenging. One such challenge in
processing is data imbalance. In the imbalanced dataset, majority classes dominate over minority classes
causing the machine learning classifiers to be more biased towards majority classes and also most
classification algorithm predicts all the test data with majority classes. In this paper, the author analysis
the data imbalance models using big data and classification algorithm
Processing of the data generated from transactions that occur every day which resulted in nearly thousands of data per day requires software capable of enabling users to conduct a search of the necessary data. Data mining becomes a solution for the problem. To that end, many large industries began creating software that can perform data processing. Due to the high cost to obtain data mining software that comes from the big industry, then eventually some communities such as universities eventually provide convenience for users who want just to learn or to deepen the data mining to create software based on open source. Meanwhile, many commercial vendors market their products respectively. WEKA and Salford System are both of data mining software. They have the advantages and the disadvantages. This study is to compare them by using several attributes. The users can select which software is more suitable for their daily activities.
Using Randomized Response Techniques for Privacy-Preserving Data Mining14894
Privacy is an important issue in data mining and knowledge
discovery. In this paper, we propose to use the randomized
response techniques to conduct the data mining computation.
Specially, we present a method to build decision tree
classifiers from the disguised data. We conduct experiments
to compare the accuracy ofou r decision tree with the one
built from the original undisguised data. Our results show
that although the data are disguised, our method can still
achieve fairly high accuracy. We also show how the parameter
used in the randomized response techniques affects the
accuracy ofth e results
Keywords
Privacy, security, decision tree, data mining
A Web Extraction Using Soft Algorithm for Trinity Structureiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Performance Analysis of Hybrid Approach for Privacy Preserving in Data Miningidescitation
Now-a day’s data sharing between two organizations
is common in many application areas like business planning
or marketing. When data are to be shared between parties,
there could be some sensitive data which should not be
disclosed to the other parties. Also medical records are more
sensitive so, privacy protection is taken more seriously. As
required by the Health Insurance Portability and
Accountability Act (HIPAA), it is necessary to protect the
privacy of patients and ensure the security of the medical
data. To address this problem, released datasets must be
modified unavoidably. We propose a method called Hybrid
approach for privacy preserving and implemented it. First we
randomized the original data. Then we have applied
generalization on randomized or modified data. This
technique protect private data with better accuracy, also it can
reconstruct original data and provide data with no information
loss, makes usability of data.
Data Transformation Technique for Protecting Private Information in Privacy P...acijjournal
Data mining is the process of extracting patterns from data. Data mining is seen as an increasingly important tool by modern business to transform data into an informational advantage. Data
Mining can be utilized in any organization that needs to find patterns or relationships in their data. A group of techniques that find relationships that have not previously been discovered. In many situations, the extracted patterns are highly private and it should not be disclosed. In order to maintain the secrecy of data,
there is in need of several techniques and algorithms for modifying the original data in order to limit the extraction of confidential patterns. There have been two types of privacy in data mining. The first type of privacy is that the data is altered so that the mining result will preserve certain privacy. The second type of privacy is that the data is manipulated so that the mining result is not affected or minimally affected. The aim of privacy preserving data mining researchers is to develop data mining techniques that could be
applied on data bases without violating the privacy of individuals. Many techniques for privacy preserving data mining have come up over the last decade. Some of them are statistical, cryptographic, randomization methods, k-anonymity model, l-diversity and etc. In this work, we propose a new perturbative masking technique known as data transformation technique can be used for protecting the sensitive information. An
experimental result shows that the proposed technique gives the better result compared with the existing technique.
Big data is a prominent term which characterizes the improvement and availability of data in all three
formats like structure, unstructured and semi formats. Structure data is located in a fixed field of a record
or file and it is present in the relational data bases and spreadsheets whereas an unstructured data file
includes text and multimedia contents. The primary objective of this big data concept is to describe the
extreme volume of data sets i.e. both structured and unstructured. It is further defined with three “V”
dimensions namely Volume, Velocity and Variety, and two more “V” also added i.e. Value and Veracity.
Volume denotes the size of data, Velocity depends upon the speed of the data processing, Variety is
described with the types of the data, Value which derives the business value and Veracity describes about
the quality of the data and data understandability. Nowadays, big data has become unique and preferred
research areas in the field of computer science. Many open research problems are available in big data
and good solutions also been proposed by the researchers even though there is a need for development of
many new techniques and algorithms for big data analysis in order to get optimal solutions. In this paper,
a detailed study about big data, its basic concepts, history, applications, technique, research issues and
tools are discussed.
ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detect...Anastasija Nikiforova
This presentation is devoted to the "ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detection tool or what Internet of Things Search Engines know about you" research paper developed by Artjoms Daskevics and Anastasija Nikiforova and presented during the The International Conference on Intelligent Data Science Technologies and Applications (IDSTA2021), November 15-16, 2021. Tartu, Estonia (web-based).
Read paper here -> Daskevics, A., & Nikiforova, A. (2021, November). ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detection tool or what Internet of Things Search Engines know about you. In 2021 Second International Conference on Intelligent Data Science Technologies and Applications (IDSTA) (pp. 38-45). IEEE.
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...cscpconf
In today’s world, gigantic amount of data is available in science, industry, business and many
other areas. This data can provide valuable information which can be used by management for
making important decisions. But problem is that how can find valuable information. The answer
is data mining. Data Mining is popular topic among researchers. There is lot of work that
cannot be explored till now. But, this paper focuses on the fundamental concept of the Data mining i.e. Classification Techniques. In this paper BayesNet, NavieBayes, NavieBayes Uptable, Multilayer perceptron, Voted perceptron and J48 classifiers are used for the classification of data set. The performance of these classifiers analyzed with the help of Mean Absolute Error, Root Mean-Squared Error and Time Taken to build the model and the result can be shown statistical as well as graphically. For this purpose the WEKA data mining tool is used.
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...theijes
Data mining works to extract information known in advance from the enormous quantities of data which can lead to knowledge. It provides information that helps to make good decisions. The effectiveness of data mining in access to knowledge to achieve the goal of which is the discovery of the hidden facts contained in databases and through the use of multiple technologies. Clustering is organizing data into clusters or groups such that they have high intra-cluster similarity and low inter cluster similarity. This paper deals with K-means clustering algorithm which collect a number of data based on the characteristics and attributes of this data, and process the Clustering by reducing the distances between the data center. This algorithm is applied using open source tool called WEKA, with the Insurance dataset as its input
We have concentrated on a range of strategies, methodologies, and distinct fields of research in this article, all of which are useful and relevant in the field of data mining technologies. As we all know, numerous multinational corporations and major corporations operate in various parts of the world. Each location of business may create significant amounts of data. Corporate decision-makers need access to all of these data sources in order to make strategic decisions.
Processing of the data generated from transactions that occur every day which resulted in nearly thousands of data per day requires software capable of enabling users to conduct a search of the necessary data. Data mining becomes a solution for the problem. To that end, many large industries began creating software that can perform data processing. Due to the high cost to obtain data mining software that comes from the big industry, then eventually some communities such as universities eventually provide convenience for users who want just to learn or to deepen the data mining to create software based on open source. Meanwhile, many commercial vendors market their products respectively. WEKA and Salford System are both of data mining software. They have the advantages and the disadvantages. This study is to compare them by using several attributes. The users can select which software is more suitable for their daily activities.
Using Randomized Response Techniques for Privacy-Preserving Data Mining14894
Privacy is an important issue in data mining and knowledge
discovery. In this paper, we propose to use the randomized
response techniques to conduct the data mining computation.
Specially, we present a method to build decision tree
classifiers from the disguised data. We conduct experiments
to compare the accuracy ofou r decision tree with the one
built from the original undisguised data. Our results show
that although the data are disguised, our method can still
achieve fairly high accuracy. We also show how the parameter
used in the randomized response techniques affects the
accuracy ofth e results
Keywords
Privacy, security, decision tree, data mining
A Web Extraction Using Soft Algorithm for Trinity Structureiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Performance Analysis of Hybrid Approach for Privacy Preserving in Data Miningidescitation
Now-a day’s data sharing between two organizations
is common in many application areas like business planning
or marketing. When data are to be shared between parties,
there could be some sensitive data which should not be
disclosed to the other parties. Also medical records are more
sensitive so, privacy protection is taken more seriously. As
required by the Health Insurance Portability and
Accountability Act (HIPAA), it is necessary to protect the
privacy of patients and ensure the security of the medical
data. To address this problem, released datasets must be
modified unavoidably. We propose a method called Hybrid
approach for privacy preserving and implemented it. First we
randomized the original data. Then we have applied
generalization on randomized or modified data. This
technique protect private data with better accuracy, also it can
reconstruct original data and provide data with no information
loss, makes usability of data.
Data Transformation Technique for Protecting Private Information in Privacy P...acijjournal
Data mining is the process of extracting patterns from data. Data mining is seen as an increasingly important tool by modern business to transform data into an informational advantage. Data
Mining can be utilized in any organization that needs to find patterns or relationships in their data. A group of techniques that find relationships that have not previously been discovered. In many situations, the extracted patterns are highly private and it should not be disclosed. In order to maintain the secrecy of data,
there is in need of several techniques and algorithms for modifying the original data in order to limit the extraction of confidential patterns. There have been two types of privacy in data mining. The first type of privacy is that the data is altered so that the mining result will preserve certain privacy. The second type of privacy is that the data is manipulated so that the mining result is not affected or minimally affected. The aim of privacy preserving data mining researchers is to develop data mining techniques that could be
applied on data bases without violating the privacy of individuals. Many techniques for privacy preserving data mining have come up over the last decade. Some of them are statistical, cryptographic, randomization methods, k-anonymity model, l-diversity and etc. In this work, we propose a new perturbative masking technique known as data transformation technique can be used for protecting the sensitive information. An
experimental result shows that the proposed technique gives the better result compared with the existing technique.
Big data is a prominent term which characterizes the improvement and availability of data in all three
formats like structure, unstructured and semi formats. Structure data is located in a fixed field of a record
or file and it is present in the relational data bases and spreadsheets whereas an unstructured data file
includes text and multimedia contents. The primary objective of this big data concept is to describe the
extreme volume of data sets i.e. both structured and unstructured. It is further defined with three “V”
dimensions namely Volume, Velocity and Variety, and two more “V” also added i.e. Value and Veracity.
Volume denotes the size of data, Velocity depends upon the speed of the data processing, Variety is
described with the types of the data, Value which derives the business value and Veracity describes about
the quality of the data and data understandability. Nowadays, big data has become unique and preferred
research areas in the field of computer science. Many open research problems are available in big data
and good solutions also been proposed by the researchers even though there is a need for development of
many new techniques and algorithms for big data analysis in order to get optimal solutions. In this paper,
a detailed study about big data, its basic concepts, history, applications, technique, research issues and
tools are discussed.
ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detect...Anastasija Nikiforova
This presentation is devoted to the "ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detection tool or what Internet of Things Search Engines know about you" research paper developed by Artjoms Daskevics and Anastasija Nikiforova and presented during the The International Conference on Intelligent Data Science Technologies and Applications (IDSTA2021), November 15-16, 2021. Tartu, Estonia (web-based).
Read paper here -> Daskevics, A., & Nikiforova, A. (2021, November). ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detection tool or what Internet of Things Search Engines know about you. In 2021 Second International Conference on Intelligent Data Science Technologies and Applications (IDSTA) (pp. 38-45). IEEE.
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...cscpconf
In today’s world, gigantic amount of data is available in science, industry, business and many
other areas. This data can provide valuable information which can be used by management for
making important decisions. But problem is that how can find valuable information. The answer
is data mining. Data Mining is popular topic among researchers. There is lot of work that
cannot be explored till now. But, this paper focuses on the fundamental concept of the Data mining i.e. Classification Techniques. In this paper BayesNet, NavieBayes, NavieBayes Uptable, Multilayer perceptron, Voted perceptron and J48 classifiers are used for the classification of data set. The performance of these classifiers analyzed with the help of Mean Absolute Error, Root Mean-Squared Error and Time Taken to build the model and the result can be shown statistical as well as graphically. For this purpose the WEKA data mining tool is used.
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...theijes
Data mining works to extract information known in advance from the enormous quantities of data which can lead to knowledge. It provides information that helps to make good decisions. The effectiveness of data mining in access to knowledge to achieve the goal of which is the discovery of the hidden facts contained in databases and through the use of multiple technologies. Clustering is organizing data into clusters or groups such that they have high intra-cluster similarity and low inter cluster similarity. This paper deals with K-means clustering algorithm which collect a number of data based on the characteristics and attributes of this data, and process the Clustering by reducing the distances between the data center. This algorithm is applied using open source tool called WEKA, with the Insurance dataset as its input
We have concentrated on a range of strategies, methodologies, and distinct fields of research in this article, all of which are useful and relevant in the field of data mining technologies. As we all know, numerous multinational corporations and major corporations operate in various parts of the world. Each location of business may create significant amounts of data. Corporate decision-makers need access to all of these data sources in order to make strategic decisions.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
The papers for publication in The International Journal of Engineering& Science are selected through rigorous peer reviews to ensure originality, timeliness, relevance, and readability.
Abstract: In many fields, such as industry, commerce, government, and education, knowledge discovery and data
mining can be immensely valuable to the subject of Artificial Intelligence. Because of the recent increase in
demand for KDD techniques, such as those used in machine learning, databases, statistics, knowledge acquisition,
data visualisation, and high performance computing, knowledge discovery and data mining have grown in
importance. By employing standard formulas for computational correlations, we hope to create an integrated
technique that can be used to filter web world social information and find parallels between similar tastes of
diverse user information in a variety of settings
A SURVEY ON DATA MINING IN STEEL INDUSTRIESIJCSES Journal
In Industrial environments, huge amount of data is being generated which in turn collected indatabase anddata warehouses from all involved areas such as planning, process design, materials, assembly, production, quality, process control, scheduling, fault detection,shutdown, customer relation management, and so on. Data Mining has become auseful tool for knowledge acquisition for industrial process of Iron and steel making. Due to the rapid growth in Data Mining, various industries started using data mining technology to search the hidden patterns, which might further be used to the system with the new knowledge which might design new models to enhance the production quality, productivity optimum cost and maintenance etc. The continuous improvement of all steel production process regarding the avoidance of quality deficiencies and the related improvement of production yield is an essential task of steel producer. Therefore, zero defect strategy is popular today and to maintain it several quality assurancetechniques areused. The present report explains the methods of data mining and describes its application in the industrial environment and especially, in the steel industry.
In the information age, data turns to be the vital. Hence it is important to understand the data in order to face the future information challenges. This paper deals with the importance of data mining while explaining the concepts and life cycle involved. It extracts the basic gist of the topic presented in a user-friendly way. Further, in developing different stages of data mining followed by its extended application usage in practical business platform.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
1. Republic of Iraq
Ministry of Higher Education & Scientific Research
Iraqi Commission for Computers and Informatics
Informatics Institute for Postgraduate Studies
Study of Association Rules'Visulalization
Techniques
A Project
Submitted to the Informatics Institute
For Postgraduate Studies of the Iraqi Commission
For Computers and Informatics as a partial fulfillment of the
Requirements for the degree of Higher Diploma in Web Site
Technology in Computer Science
By
Mustafa S.Shaheed
Supervised by
Dr. Hussein K. Khafaji
Baghdad, Iraq
Feb 2011 1432
4. III
Acknowledgments
My first and deepest gratitude goes to ALLAH the
almighty for his uncountable blessing, help, and
guidance.
I would like to express my deepest appreciation to
my supervisor Dr. Hussein K. Khafaji for his guidance,
helpful, comments, and suggestions.
5. IV
Supervisor's Certification
I certify that the project entitled "Comparative Study of
Association Rules'Visulalization Techniques” was prepared under
my supervision at the Informatics Institute for Postgraduate
Studies in Iraqi Commission for Computers and Informatics as a
partial fulfillment of the requirements for the degree of Higher
Diploma in Web Site Technology in Computer Science.
Signature:
Name: Dr. Hussein K. Khafaji
Date: /2/2011
6. V
Examining Committee Certification
We certify that we read this project, entitled " Comparative Study
of Association Rules'Visulalization Techniques ", and as an examining
committee, examined the student " Mustafa S. Shaheed", in the contents and
what is related to it and that in our opinion it meet the standard of a project
for the Higher Diploma in Web Site Technology in Computer Science.
Signature
Name: Dr. Hussein K. Khafaji
Title:
Date: /2/2011
Supervisor
Approved by the Informatics Institute for Postgraduate Studies of the
Iraqi Commission for Computers and Informatics.
Signature
Name: Prof. Dr. Imad Hussain Al-Hussaini
Date: /10/2010
Dean of the Institute
Signature
Name: Dr.
Title:
Date: /2/2011
Chairman
Signature
Name: Dr.
Title:
Date: /2/2011
Member
Signature
Name: Dr.
Title:
Date: /2/2011
Member
7. VI
Abstract
Computers are used in more and more areas, large volumes of data have
been collected and stored in the database continuously. An important issue is to
figure out how to find the useful information from these massive data.
Data mining, also known as knowledge discovery in databases, is such a
research area to extract implicit, understandable, previously unknown and
potentially useful information from data.
Association Rules are one of the most widespread data mining tools because
they provide valuable information for many application fields, in spite of their
mining difficulties.
The exploration of large data sets is an important but difficult problem.
Information visualization techniques can be useful in solving this problem.
Visual data exploration has a high potential, and many applications.
Association Rules Visualization is emerging as a crucial step in a data
mining process in order to profitably use the extracted knowledge.
In this project, most important techniques of association rule visualization are
study which used to present the association rule that discovered from databases by
used algorithms 0Tdeveloped0T1T 0T1Tfor this0T1T 0T1Tpurpose and identify0T1T 0T1Tthe strengths0T1T 0T1Tand
weaknesses0T1T 0T1Tof0T1T 0T1Tthese0T1T 0T1Ttechniques to reach0T1T 0T1Tthe0T1T 0T1Tmost0T1T 0T1Tappropriate0T1T 0T1Ttechnology0T1T 0T1Tto
solve 0Tthe main drawback of Association Rules.
8. VII
Title Page
Chapter One: Introduction 1
1.1 Introduction 2
1.2 Introduction to Data Mining 2
1.3 Introduction to Association Rule 3
1.4 Introduction to Functional Dependencies 4
1.4.1 Candidate Key 5
1.5 Aim of the study 6
Chapter Two: Data Mining And Functional Dependency 8
2.1 Introduction 9
2.2 Data Mining Overview 9
2.2.1 Data Mining Application 10
2.2.2 The process before Data Mining 10
2.2.3 Data Mining tasks 11
2.2.3.1 Association Rules 12
2.2.3.2 Apriori algorithm 15
2.3 Functional depe 16
2.3.1 Definition (1) 17
2.3.2 Definition (2) 18
2.3.3 Multi Valued Dependencies 23
2.4 Candidate Keys 24
2.5 Primary Key 25
2.6 Super key 26
List of Contents
9. VIII
2.7 Armstrong's Axioms 27
Chapter Three: proposed System To Determine the
Candidate Keys
31
3.1 Introduction 32
3.2 The relation between data mining and functional
dependency
32
3.3 An Algorithm of determining closure sets 32
3.4 System Architecture 34
3.4.1 Sets Generator 35
3.4.2 Candidate key tester 36
3.5 Set closure producer 42
3.6 key filter 46
3.7 Candidate keys system execution 47
Chapter four: Discussion, and Future works 52
4.1 Discussion 53
4.2 Future works 54
10. IX
List of algorithms
Algorithm (3-1) testing the closure of sets of attributes algorithm 33
Algorithm (3-2) Rule testing algorithm 43
Algorithm (3-3) Closure generator algorithm 44
List of programs
Program (3-1) Candidate key tester 41
Program (3-2) Candidate key function 42
Program (3-3) merge program 45
List of Figures
Figure (3-1) the architecture of generating candidate keys 34
Figure (3-2) the main view of application 47
Figure (3-3) the interface of set generator 48
Figure (3-4) the interface of canidiate key tester 49
Figure (3-5) the interface of table (sets) 50
Figure (3-6) the in oterfacef table (candid) 51
11. X
List of tables
Table (2.1) A database with 4 items and 5 transactions 12
Table (2.2) How employees get to work 19
Table (2.3) Functional Dependencies defined over two sets 20
Table (2.4) Employees information 21
Table (2.5) Students information 22
Table (2.6) Managers phone# 23
Table (2.7) Manager- employee 23
Table (2.8) Relation of Managers, phone, and employee 24
Table (3.1) Sets stored table 36
Table (3.2) Candidate keys stored table 37
Table (3.3) Temporary values stored table 37
13. Introduction
2
Chapter 1
Chapter one
Introduction
Knowledge discovery in databases (KDD) is a new field
depending on ideas from statistics, machine learning, databases, parallel
computing, computer graphics, data visualization, and other fields. KDD
systems generally use methods , algorithms, and techniques from all of
these fields. It has been materialized due to the extraordinary growth of
data in all specialties of human activities, disability of database
management system (DBMS) to extract hidden knowledge in databases,
1.1 Overview
Recent years have seen an enormous increase in the amount of
information stored in electronic format. It has been estimated that the
amount of collected information in the world doubles every 20 months
and the size and number of databases are increasing even faster and the
ability to rapidly collect data has outpaced the ability to analyze it.
Information is crucial for decision making, especially in business
operations. As a response to those trends, the term 'Data Mining' (or
'Knowledge Discovery') has been coined to describe a variety of
techniques to identify nuggets of information or decision-making
knowledge in bodies of data, and extracting these in such a way that they
can be put to use in the areas such as decision support, prediction,
forecasting and estimation. Automated tools must be developed to help
extract meaningful information from a flood of information. Moreover,
these tools must be sophisticated enough to search for correlations
among the data unspecified by the user, as the potential for unforeseen
relationships to exist among the data is very high. A successful tool set
to accomplish these goals will locate useful nuggets of information in
the otherwise chaotic data space, and present them to the user in a
contextual format.
14. Introduction
3
Chapter 1
and the need for economic and scientific tools such knowledge. KDD
includes techniques and tools to address this need.
defines knowledge discovery in databases as follows[27]:
"KDD is the non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable patterns in the data".
Many literatures used the terms data mining (DM) and KDD
interchangeably and regard them as synonymous. At the first
international KDD conference in Montreal in 1995, it was proposed that
the term "KDD" be employed to describe the whole process of
extraction of knowledge from data. It was further proposed that the term
'data mining' should be used exclusively for the discovery stage of the
KDD process. A more or less official definition of DM is the process of
automatic extraction of novel, useful, and understandable patterns
in large databases[20,21]. Hence, KDD
includes many steps such as Focussing, Preprocessing,
Transformation, Data Mining and Evaluation. Figure (1.1) abstracts
the KDD process[14].
1- Focussing :- define the goal of the particular KDD task.
2- Preprocessing :- specified data has to be integrated.
3- Transformation :- assure that each data object is represented in a
common form which is suitable as input in the next step.
4- Data Mining :- detect the desired patterns contained within the
given data.
5- Evaluation :- the user evaluates the extracted patterns with
respect to the task defined in the focussing step.
15. Introduction
4
Chapter 1
data mining is the most important step within the KDD
process, defines data mining as follows[27]:
Data mining is a step in the KDD process consisting of applying data
analysis and discovery algorithms that, under acceptable computational
efficiency limitations, produce a particular enumeration of patterns over
the data.
According to this definition data mining is the step that is responsible
for the actual knowledge discovery and the data minig has many step
such as Association Rules (AR), Sequential Patterns, Classification,
Clustering, Similarity search.
Association Rules is the most important task of DM. ARs represent the
correlation between sets of items in transaction database. An AR is an
implication of the form:
X c%
means that the person who reads the novels "The love in cholera
era",
Y , where X, and Yare sets of items each of which is called
itemsets.{X} is called antecedent, while Y is called consequent such that
{X} ∩ {Y}=∅ and C% is the confidence of the implication, for example
the following rule
The Merchant of Venice
The ARs are extracted from mined frequent itemsets. Mining of
frequent itemsets is a very complex process[3].the mining of association
rules consists of two steps; the first one is mining of frequent itemsets
", and "Zoorba", also reads the novels
{"The Trees and Marzooq's Association", "One Hundred Years of
Segregation}, with certainty factor of 60%. The confidence of a rule is
calculated as follows:
Confidence = support (X∪Y)/support (X).
where the support of an itemset is the number of its occurrences in the
database. The confidante rule is of confidence greater than or equal to
the user defined threshold called minimum confidence, minconf
{ “The love in cholera era” , “The Merchant of Venice “ , “Zoorba”} 60%
{“ the Tree and Marzooq’s Association” , “One Hundred Years of egregation”}
16. Introduction
5
Chapter 1
while the second one is extracting the rules from these frequent ilemsets.
The first step, intermediate step, is massive computational step and
attains the interest of the researcher since for many years many
algorithms have been produced to accomplish this complicated mining
process such as apriori, aprioriTID, aprioriHyprid [20], FP-growth
[12], and CHARM [17], . The second step is extracting the association
rules from the results of the previous step.
The main drawback of Association Rules is thus the huge number
of extracted rules that cannot be manually inspected by that and the
existence of trivial or meaningless associations that are usually mined
due to the exhaustive nature of the extraction algorithms[24]. Graphical
tools and pruning methods are the main approaches used to face these
problems and to make data mining to be effective and well-Evaluated, it
is important to include the human in the data exploration process and
combine the flexibility, creativity, and general knowledge of the human
with the enormous storage capacity and the computational power of
today’s computers. Visual data exploration aims at integrating the
human in the data exploration process, applying human perceptual
abilities to the analysis of large data sets available in today’s computer
systems. The basic idea of visual data exploration is to present the data
in some visual form, allowing the user to gain insight into the data, draw
conclusions, and directly interact with the data. Visual data mining
techniques have proven to be of high value in exploratory data analysis,
and have a high potential for exploring large databases. Visual data
exploration is especially useful when little is known about the data and
the exploration goals are vague. Since the user is directly involved in the
exploration process, shifting and adjusting the exploration goals is
automatically done if necessary.There are many techniques used to
visually represent the data we will discuss some of them in this project.
17. Introduction
6
Chapter 1
Figure (1-1) Visualization and Data Mining
The aim of the project is a Study of techniques which used to
present the association rule that discovered from databases by used
algorithms
1.2 Aim of the project
developed for this purpose and identify the strengths and
weaknesses of these techniques to
18. Introduction
7
Chapter 1
reach the most appropriate technology to solve the main drawback of
Association Rules.
1.3 Project Outline
Chapter two explains the stage of Knowledge Discovery in
Databases (KDD), task of data mining and concentrates on
Association rules(AR).
Chapter three focus on concept of Visualization, Visualization
Benefits and Visualization Techniques which used to visualize the
association rules (AR) due to their importance as an interesting field of
this study.
Chapter four presents the summary and future work of the
techniques used to visualized association rules.
20. Data mining and Association Rules
9
Chapter 2
Chapter Two
Data mining and Association Rules
2.1 Introduction
This chapter presents the general steps of Knowledge discovery
in databases (KDD) and its relation with data mining. Also, it presents
the tasks of data mining (DM) and concentrates on Association rules
due to their importance as an interesting field of DM.
2.2 Knowledge Discovery in Databases
In recent years the amount of data that is collected by advanced
information systems has increased tremendously. Although very useful
information of strategic importance is buried within this data, this
information is not readily available for the users To analyze these huge
amounts of data, the interdisciplinary field of Knowledge Discovery in
Databases (KDD) has emerged. Applies efficient algorithms to extract
interesting patterns and regularities from the data.
KDD is defined as follows[27] :
Knowledge Discovery in Databases is the non-trivial process of
identifying valid, novel, potentially useful, and ultimately
understandable patterns in data.
21. Data mining and Association Rules
10
Chapter 2
According to this definition, data is a set of facts that is somehow
accessible in electronic form. The term patterns indicate models and
regularities which can be observed within the data. Patterns have to be
valid, i.e. they should be true on new data with some degree of certainty.
A novel pattern is not previously known or trivially true. The potentially
usefulness of patterns refers to the possibility that they lead to an action
providing a benefit.
A pattern is understandable if it is interpretable by a human user.
At last KDD is a process, indicating that there are several steps that are
repeated in several iterations.
Figure 2.1 displays the process of KDD in its basic form.
Figure (2-1) The KDD process
22. Data mining and Association Rules
11
Chapter 2
1- Focussing
2.3 KDD Process Stages
KDD process is an interactive and iterative multi-step process
which uses five steps to extract interesting knowledge according to
some specific measures and thresholds.[14]
2- Preprocessing
3- Transformation
4- Data Mining
5- Evaluation
2.3.1 Focussing
The first step is to define the goal of the particular KDD task.
Another important aspect of this step is to determine the data to be
analyzed and how to obtain it.
2.3.2 Preprocessing
In this step the specified data has to be integrated, because it is not
necessarily accessible on the same system. Furthermore, several objects
may be described incompletely. Thus, the missing values need to be
completed and inconsistent data should be corrected or left out.
2.3.3 Transformation
The transformation step has to assure that each data object is
represented in a common form which is suitable as input in the next step.
23. Data mining and Association Rules
12
Chapter 2
2.3.4 Data Mining
Data mining is the application of efficient algorithms to detect the
desired patterns contained within the given data. Thus, the data mining
step is responsible for finding patterns according to the predefined task.
Since this step is the most important within the KDD process, we are
going to have a closer look at it in the next section(2.4).
2.4 Data Mining
2.3.5 Evaluation
At last, the user evaluates the extracted patterns with respect to the
task defined in the focussing step. An important aspect of this evaluation
is the representation of the found patterns. Depending on the given task,
there are several quality measures and visualizations available to
describe the result. The important phase to represent the result of KDD
process by visualization techniques, these techniques allow the user to
assess the results in easier and more flexible. If the user is satisfied with
the quality of the patterns, the process is terminated. However, in most
cases the results might not be satisfying after only one iteration. In those
cases, the user might return to any of the previous steps to achieve more
useful results.
Since data mining is the most important step within the KDD
process, we will treat it more carefully in this section. In [27, 30] Data
Mining is defined as follows:
Data mining is a step in the KDD process consisting of applying
data analysis and discovery algorithms that, under acceptable
24. Data mining and Association Rules
13
Chapter 2
computational efficiency limitations, produce a particular enumeration
of patterns over the data.
According to this definition data mining is the step that is responsible for
the actual knowledge discovery. To emphasize the necessity that data
mining algorithms need to process large amounts of data, the desired
patterns has to be found under acceptable computational efficiency
limitations. Let us note that there are many other definitions of data
mining and that the term data mining and KDD are often used in a
synonymous way.
Data mining has many tasks such as:
1- Association Rules (AR): Given a database of transactions, where each
transaction consists of a set of items, association discovery finds all the
item sets that frequently occur together, and also the rules among them.
we are going to have a closer look at it in the next section(2.5).
2- Sequential Patterns: Sequence Discovery aims at extracting sets of
events that commonly occur over a period of time.
3- Classification and Regression: Classification aims to assign a new data
item to one of several predefined categorical classes. The goal of
classification and regression is to build a model that minimizes the error
between the predicted and true values of the target variable. [15,18]
it known as supervised induction[14]. Supervised induction is the
machine learning task of inferring a function from supervised training
data[30].
4- Clustering: Clustering is the process of grouping the data records into
meaningful subclasses (clusters) in a way that maximizes the similarity
within clusters and minimizes the similarity between two different
clusters [10].clustering is also called unsupervised induction.[3]
25. Data mining and Association Rules
14
Chapter 2
5- Similarity search: Similarity search is performed on a database of
objects to find the object(s) that are within a user-defined distance from
the queried object, or to find all pairs within some distance of each other.
Figure (2-2) Classification separates the data space (left) and clustering
groups data objects (right)
2.5 Association Rule
Association rules are ones of the promising aspects of data mining
as knowledge discovery tool, and have been widely explored to
date[27,14]. They allow to capture all possible rules that explain the
presence of some attributes according to the presence of other attributes.
An association rule, X⇒ Y, is a statement of the form "for a specified
fraction of transactions, a particular value of an attribute set X
determines the value of attribute set Y as another particular value under a
certain confidence". Thus, association rules aim at discovering the
patterns of co-occurrences of attributes in a database. For instance, an
association rule in a supermarket basket data may be "In 10% of
transactions, 85% of the people buying milk also buy milky-sweets
in that transaction". The association rules may be useful in many
26. Data mining and Association Rules
15
Chapter 2
applications such as supermarket transactions analysis, store layout and
promotions on the items, telecommunications alarm correlation,
university course enrollment analysis, customer behavior analysis in
retailing, catalog design, word occurrence in text documents, stock
transactions, etc[29,21,16].
Let I = {I1,..., Im} be a set of literals, called items. Let D be a set of
transactions, where each transaction T is a set of items such that T ⊆ I,
and each transaction is associated with a unique identifier called TID.
Definition 2.1 An itemset X is a set of items in I. An itemset X is called a
k-itemset if it contains k items from I.
Definition 2.2 A transaction T satisfies an itemset X if X ⊆ T. The
support of an itemset X in D, supportD
Definition 2.5 An association rule is an implication of the form X ⇒ Y,
where X ⊂ I, Y ⊂ I, and X ∩ Y = φ. X is called the antecedent of the
rule, and Y is called the consequent of the rule. The rule X ⇒ Y holds in
(X), is the number of transactions
in D that satisfies X.
Definition 2.3 An itemset X is called a large itemset if the support of X
in D exceeds a minimum support threshold explicitly declared by the
user, and a small itemset otherwise.
Definition 2.4 The negative border of a set S ⊂ P(R), closed with
respect to the set inclusion relation, is the set of minimal itemsets X ⊂ R
not in S. The negative border of the set of large itemsets is the set of
itemsets that are generated as a candidate but fail to qualify into the set
of large itemsets.
27. Data mining and Association Rules
16
Chapter 2
D with confidence c where c=supportD(X ∪Y)/supportD(X). The rule
X⇒Y has support s in D if the fraction s of the transactions in D
contain X ∪Y.
Example: Suppose I={A, B, C, D, E} is the abbreviation of movie title
in Movie-CD shop, these abbreviation are shown in Table (2.1). Table
(2.2)
Represent a database of the shop sells. Each transaction is defined
Transaction identifier, TID. Table (2.3) shows the frequent itemsets
according To minsup =50%, while Table (2.4) depicts all the ARs
according to Minconf = 100%.
Table (2.1) The items abbreviations of Database
Item Abbreviation
A Golden mountain
B Gone with the Wind
C Zoorba
D Rain Man
E Sound of Music
28. Data mining and Association Rules
17
Chapter 2
Table (2.2) The items abbreviations of Database
Transaction TID (Person) Items-(Attributes)
1 B,C,E
2 B,C,D,E
3 A,B,C,D,E
4 B,C,D
5 A,B,F
6 A,B,C,E
Table (2.3) Large itemsets with minsup = 33%=2
Support Itemsets No.
6=100% B 1
5=83% C,BC 2
4=67% E,BE,CE,BCE 4
3=50% A,D,AB,BD,CD,BCD 6
2=33%
AC,AE,DE,ABC,ABE,ACE,BDE,
CDE,ABCE,BCDE
10
Table(2.4)AssociationRules
Associationruleswithminconf=100%
A→B(3/3) AC→B(2/2) AC→BE(2/2)
C→B(5/5) AE→B(2/2) AE→BC(2/2)
D→B(3/3) AC→E(2/2) DE→BC(2/2)
E→B(4/4) AE→C(2/2) ABC→E(2/2)
D→C(3/3) DE→B(2/2) ABE→C(2/2)
E→C(4/4) DE→C(2/2) ACE→B(2/2)
ABE→C(2/2) ACE→B(2/2) ABC→E(2/2)
29. Data mining and Association Rules
18
Chapter 2
The mining of Association Rules is decomposed into two sub
problems:
1- Discovering all frequent, (large), patterns (represented by large
itemsets
defined above), and;
2- Generating the association rules from those frequent itemsets.
The first sub problem is very tedious, I/O intensive, and
Computationally expensive for very large databases and this is the case
for many real life applications. In large retailing data, the number of
transactions is generally in the order of millions, and number of items
(attributes) is generally in the order of thousands. When the data
contains N items, then the number of possible large itemsets is 2N. There
are many algorithms to mine frequent itemsets such as apriori,
aprioriTID, and aprioriHyprid,[12]The second problem is
straightforward, and can he done efficiently in a reasonable time and
there is a well known algorithm presented in to accomplish the
extraction of AR. The databases of frequent itemsets and ARs are
assumed to be available in this thesis, therefore there IS no focus on any
frequent itemset and AR mining algorithms.
31. Visualization Techniques of Association Rules
20
Chapter 3
Chapter Three
Visualization Techniques of Association Rules
3.1 Introduction
This chapter, presents the concept of visualization, visualization
benefits and Visualization Techniques which used to visualize the
association rules (AR) in KDD process.
3.2 Visualization
Visualization is the process of transforming data, information,
and knowledge into visual form making use of human’s natural visual
capabilities [9]. Typical of a visualization application is the field of
computer graphics. The invention of computer graphics may be the most
important development in visualization since the invention of central
perspective in the renaissance period. The development of animation
also helped advance visualization. In spite of the importance of the
visualization, there are many limitations and difficulties that must be
taken in consideration such as [28, 4]:
The main limitations are:
• Visualization techniques are always difficult to evaluate. This one is no
exception.
• The implementation may require, the use of an operating system from
one specific vendor.
•The visualization techniques offered are very limited.
• The limitation of many 3D visualizations is the possible waste of
screen space towards the comers of the screen.
• The traditional menu bar approach would require long mouse
movements from the visualization to the menu bar and vice versa.
32. Visualization Techniques of Association Rules
21
Chapter 3
•Object interacting complexity occurs within 3-d environment, for
example the user can transform the parallel bar chart into a matrix
format and vice versa.
3.3 Benefits of Visualization
Visual data exploration can be seen as a hypothesis generation
process, the visualizations of the data allow the user to gain insight into the
data and come up with new hypotheses. The verification of the hypotheses
can also be done via data visualization, but may also be accomplished by
automatic techniques from statistics, pattern recognition, or machine
learning. In addition to the direct involvement of the user, the main
advantages of visual data exploration over automatic data analysis
techniques are:
• Visual data exploration can easily deal with highly non-homogeneous
and noisy data.
• Visual data exploration is intuitive and requires no understanding of
complex mathematical or statistical algorithms or parameters.
• Visualization can provide a qualitative overview of the data, allowing
data phenomena to be isolated for further quantitative analysis.
As a result, visual data exploration usually allows a faster data
exploration and often provides more interesting results, especially in
cases where automatic algorithms fail. In addition, visual data
exploration techniques provide a much higher degree of confidence in
the findings of the exploration. These facts lead to a high demand for
visual exploration techniques and make them indispensable in
conjunction with automatic exploration techniques [6].
3.4 Visualization of Association Rule
Visualizing association rules aims at solving some major
problems that come with association rules. First of all the rules found by
automatic procedures must be filtered. Depending on what minimum
confidence and what support is specified a vast amount of rules may be
generated.
There are at least five parameters involved in a visualization of
association rules [19].
· Sets of antecedent items.
· Sets of consequent items.
33. Visualization Techniques of Association Rules
22
Chapter 3
· Associations between antecedent and consequent.
· Rules' support.
. Rules' confidence.
The goal of association rule generation is to find interesting patterns
and trends in transaction databases. Association rules are statistical
relations between two or more items in the data set. In a supermarket
basket application, associations express "the relations between items that
are bought together. It is for example interesting if we find out that in
70% of the cases when people buy bread, they also buy milk.
Association rules tell us that the presence of some items in a transaction
implies the presence of other items In the same transaction with a certain
probability, called confidence. A second important parameter is the
support of an association rule, which is defined as the percentage of
transactions in which the items co·occur.
Let I = {i1., .. .in} be a set of items and let D be a set of transactions,
where each transaction T is a set of items such that T ⊆ I. An association
rule is an implication of the form X → Y, ,where X ⊆I ,Y ∈ I, X, Y≠ O.
The confidence c is defined as the percentage of transactions that contain
Y, given X The support is the percentage of transactions that contain
both X and Y. For a given support and confidence level, there are
efficient algorithms to determine all association rules. A problem,
however, is that the resulting set of association rules is usually very
large, especially for low support and confidence levels [8,9]. Using
higher support and confidence levels may not be effective since then,
useful rules may be overlooked. Pattern visualization techniques have
been used to overcome this problem and to allow an interactive selection
of good support and confidence levels. Figure (2.5) shows SGI MineSets
Rule Visualizer[14], which maps the left and right hand sides of the
rules to the x- and y-axes of the plot, respectively, and shows the
confidence as the height of the bars and the support as the height of the
discs.
The color of the bars shows the interestingness of the rule.
34. Visualization Techniques of Association Rules
23
Chapter 3
Figure (3.1) MineSet's Association Rule Visualizer
Using the visualization, the user is able to see groups of related rules and
the impact of different confidence and support levels. The goal of
association rules visualization is to visualize a large number of
association rules and their metadata in two- dimensional (2D) or
three-dimensional (3D) display with minimum human interaction,
minimum occlusion, and no screen swapping. There are many
approaches developed to visualize association rules which are the:
1- Rule Table
2- two-dimensional matrix
3- directed graph
4- rule-item approach
5- Mosaic Plot
6- Double Decker Plot,
7- Parallel Coordinates,
8- Many- to- Many AR Visualization Technique.
U3.4.1 Rule TableU
The most straightforward method for the association rule
visualization is to use the rule table. The following rule table format has
been used [26]:
tem
1
Item
2
Item
3
Item
4
Item
5
Item
N
Rule
N
Antecedent
N
Confidence Support
35. Visualization Techniques of Association Rules
24
Chapter 3
Here Item1, Item2, …, and Item5 mean the 5 items, Rule N means the
number of item in rule, antecedent N means the number of item in rule
antecedent ,
Rule N – antecedentN= consequent.
Table (3.1) Example of Association Rules in Rule Table Format
Item 1 Item2 Item3 Item4 Item5 Item
5
Rule
N
Antecedent
N
Confidence Support
Bread Milk Null Null Null Null 2 1 90% 10%
Eggs Bread Milk Null Null Null 3 1 85% 7%
Milk Bread Eggs Olive Null Null 4 2 60% 3%
In Table 3.1, rule #3 (the third row), the column Rule N= 4 means the
rule consists of 4 items.’ antecedentN=2’ means there are 2 items in the
rule head.
Milk, Bread 60%
Eggs, Olive and support 3%.
Rule table is the most straightforward way to show the association
rule to the users. However, the rule table is only suitable to display the
limited number of rules to the users. If the user needs to have a global
view of all the rules, the rule table is not a suitable approach.
• The strengths of a 2D matrix, however, break down when we need to
Visualize many-to-one relationships such as association rules with
3.4.2 Two-Dimensional Matrix
The design of a two-dimensional (2D) association matrix
positions the antecedent and consequent items on separate axes of a
square matrix. Customized icons are drawn on certain matrix tiles that
connect the antecedent and the consequent items of the corresponding
association rules. Different icons can be used to depict different
metadata such as the support and confidence values of the rules. Figure
(2.2) depicts an association rule (B→C). Both the height and the color of
the column icon can be used to present metadata values. The values of
support and confidence are mapped to 3D columns that are built
separately on and beneath the matrix tiles. Other icons such as disk and
bar are also used to visualize metadata in the rule visualize of MineSet
[4,22,28] . A 2D matrix is arguably the most effective technique to show
one-to- one binary relationship.
36. Visualization Techniques of Association Rules
25
Chapter 3
multiple antecedent items. For example, in Figure (2.3) it is almost
impossible to tell whether there is only one association rule (A+B→C) or
two (A→C and B→C).
• the lack of a practical way to identify the togetherness of individual
antecedent items makes a 2D matrix a weaker candidate to visualize
rules with multiple antecedent items. MineSet[23] addresses the problem
by grouping all the antecedent items of an association rule as one unit
and plotting it against its consequent, i.e., an antecedent -to-consequent
plot. For example, a dedicated item group (A+B) is created in Figure
(3.4) to describe the association rule (A +B→C).
Figure (3.2) The colored column indicates the association
rule (B →C). Different icon colors are used to show
different metadata values of the association rule
• The strategy works fine for smaller antecedent sets (e.g., less than
3items). In our text mining studies, we encounter association rules with
as many as 12 items in the antecedent.
• The replication of items in the antecedent groups creates a much larger
antecedent-to-consequent plot when compared with the corresponding
item-to-item plot.
The loss of item identity within an antecedent group also defeats the
purpose of visualizing the associations with a matrix. For example, the
row (or column) of the matrix connected to an item can no longer be
used to search for all the rules involving that item.
37. Visualization Techniques of Association Rules
26
Chapter 3
Figure. (3.3) It is Very difficult to determine the differences
between (A+B→C) and (A→C and B→C)
Figure (3.4) The identities of A and B are lost in the
new item group that was created to depict the
association rule (A+B→C).
• Another problem in a 2D·matrix display is object occlusion, especially
when multiple icons are used to depict different metadata values on the
matrix tiles. The occlusion problem is obvious in Figure (3.5).
38. Visualization Techniques of Association Rules
27
Chapter 3
Figure (3.5) Object occlusions are unavoidable.
Figure (3.6) Left: A →C and B →C. Right: A+B→C.
3.4.3 Directed Graph
A directed graph is another prevailing technique to depict item
associations. The nodes of a directed graph represent the items, and the
edges represent the associations. Figure (3.6) shows three association
rules (A→C, B→C, A+B→C).
• This technique works well when only a few items (nodes) and
associations (edges) are involved. An association graph can quickly turn
in to a tangled display with as few as a dozen rules. Hetzler et at [19]
address the problem by animating the edges to show the association of
certain items with 3D rainbow arcs. The animation technique requires
significcp1t human interaction to turn on and off the item nodes. It is not
an easy task to show multiple metadata values including support and
confidence, alongside the association rules.
39. Visualization Techniques of Association Rules
28
Chapter 3
3.4.4 Rule-to-Item Visualization Technique
To visualize many-to-one association rules, instead of using the
tiles of a 2D matrix to show the item-to-item association rules, the
matrix of the rule-to-item relationship is used to depict many-to-one
rule[19]. In figure (3.7) the rows of the matrix floor represent the items
(or topics in the context of text mining), and the columns represent the
item associations. The blue and red blocks of each column (rule)
represent the antecedent and the consequent of the rule. The identities of
the items are shown along the right side of the matrix. The confidence
and support levels of the rules are given by the corresponding bar charts
in different scales at the far end of the matrix. The rule-to-item
visualization approach has many advantages over all the other matrix-
based predecessors:
•There is virtually no upper limit on the number of items in an
antecedent. We can analyze the distributions of the association
rules(horizontal axis) as well as the items within (vertical axis)
simultaneously.
•Unlike Figure (3.4), the identity of individual items within an
antecedent group is clearly shown.
•No new antecedent groups are created because of the multiple
antecedent items in association rules.
•Because all the metadata are plotted at the far end and the height of the
columns is scaled so that the front columns do not block the rear ones,
few occlusions occur.
• No screen swapping, animation, or human interaction (other than basic
mouse zooming) is required to analyze the rules.
Although this technique is the better one, there are fatal drawbacks that
are suffers from, such as:
• It is unable to visualize many-to-many association rule.
• It suffers from antecedent-consequent interlining, i.e interleaving of the
items of the antecedent and consequent, although they are given
different colors
40. Visualization Techniques of Association Rules
29
Chapter 3
• Deterioration of the naturalness of the rule's parts sequence.
Figure (3.7) A visualization of item associations with
support 0.4% and confidence 50%.
Parallel Coordinates [1,2,13],the Basic elements of association
rules are sets of items, which can be handled by listing all items along a
vertical coordinate. The resulting coordinate is then repeated evenly in
the horizontal direction until there are enough coordinates to host the
longest of the association rule. An association rule can be visualized as a
polygonal line connecting all items in the rule. Parameters such as
support factor and confidence can be mapped to graphics features such
as line-width and color. Figure (3.8) illustrates an association rule ab →
cd as one polygonal line for its LHS, followed by an arrow connecting
another polygonal line for its RHS. This visualization handles nicely the
3.4.5 Parallel Coordinates
41. Visualization Techniques of Association Rules
30
Chapter 3
upward closure property of association rules: subsets of the RHS are
absorbed and are not displayed. For example, ab → cd implies that abc
→ d, abd → c, ab → c, and ab → d are valid association rules. The
implied association rules are not displayed.If two or more itemsets or
rules have parts in common, for example, adbe and cdb in Figure (3.8).
Figure (3.8) association rule ab → cd in Parallel Coordinates
Visualization technique
U3.4.6 Mosaic Plot
The basic idea is to partition a rectangle on the y-axis according to
one attribute and make the regions proportional to the sum of the
corresponding data values the height of the bars instead of the width to
show the parameter value. Then each resulting area is split in the same
way according to a second attribute [13]. The coloring reflects the
percentage of data items that fulfill a third attribute. The visualization
shows the support and confidence values of all rules of the form X1,X2
→ Y Figure (3.9). Mosaic plots are restricted to two attributes on the left
side of the association rule [6].
42. Visualization Techniques of Association Rules
31
Chapter 3
Figure (3.9) X1,X2 → Y in Mosaic Plot
Figure (3.10) X1,X2 → Y in Double Decker Plot
3.4.7 Double Decker Plot
Double decker plots can be used to show more than two attributes
on the left side. The idea is to show a hierarchy of attributes on the
bottom (heineken, coke, chicken in the example shown in figure (3.10)
corresponding to the left hand side of the association rules and the bars
on the top correspond to the number of items in the corresponding subset
of the database and therefore visualize the support of the rule. The
colored areas in the bars correspond to the percentage of data
transactions that contain an additional item and therefore correspond to
the support [6,11].
43. Visualization Techniques of Association Rules
32
Chapter 3
As previously mentioned, three approaches developed to
visualize association rules are the two-dimensional matrix, directed
graph, and rule-item approach. Also, it is shown that rules-item approach
is the best technique in spite of its drawbacks such as its inability to
represent many-to -many AR and interlining of consequent and
antecedent items in the visualization area. This section presents a new
technique which excludes these drawbacks. It excludes the items
interleaving and efficiently represents many-to-many AR. This
technique has been called many-to-many AR visualization technique,
MARVT. In this technique the visualization area is divided into three
regions; antecedent region, statistical region, and consequent region.
This technique can be implemented in 2- dimension or 3- dimension. If
the 2-dimension implementation is chosen, the x-axis of the visualization
area is rule identifiers, while the y-axis of antecedent region is items of
the antecedent of the rules to be visualized. The y-axis of the statistical
region is divided according to the confidence and support level of the
rules, while the y-axis of the antecedent region is the items of the
consequent of the selector rules. Figure (3.11) depicts the general
structure of visualization area of the proposed technique. If an item i is
belonging to the antecedent of a rule R a red ellipse is drawn in (R, i)
position of the antecedent region and if an item j is part of the
consequent of the rule R, a black ellipse is drawn in the (R, j) position of
consequent area. The statistical region contains an important statistical
value such as the confidence, support, support of antecedent item set
and- support of consequent itemset of each rule in a specified region of a
rule. The y-axis of statistical region is divided beginning at the minsup
and minconf threshold and ending with 100%. The technique is flexible
to visualize more statistical information such as the support for each
item. Also, it is possible to display the order of the rule. If this technique
is implemented as a 3-dimension, the same regions are utilized. X-axis is
determined by rule id. Y-axis is determined by the items of antecedent
and consequent for their regions respectively. Z-axis is determined by
the support and confidence beginning at minconf or minsup threshold.
3.4.8 Many to Many AR Visualization Technique
44. Visualization Techniques of Association Rules
33
Chapter 3
The third dimension is used to show the support of the items, the
confidence, and the support of a rule, and the support of antecedent
itemset and consequent itemsets. In this technique it is possible to
visualize many-many rules, one-to-many, many-to-one, etc. because it
determines two separated regions for antecedent an consequent which
hold unlimited number of items. This separation, also, excludes the
items interlining because the items of consequent and antecedent are
presented at different regions.
Figure (3.11) General Structure of Visualization Area of
Proposed Many-to-Many Association Rules
Visualization Technique, MARVT .
45. Visualization Techniques of Association Rules
34
Chapter 3
To give more_ illustration of this technique, for example, consider
the
following rules:
1- a,b→c,q1 and its confidence, and support are 63, 2 respectively.
2- a,b,c→q1,m and its confidence, and support are 100, 3 respectively.
3-b,c→c,m,q1 and its confidence, and support are 50, 1 respectively.
Figure (3, 12) shows the hypothesis visualization of these rules. As
shown the antecedent items of R1 are a and b therefore, the position
(R1, a)
Figure (3.12) Visualization Area of
Many-to-Many Association Rules
Visualization Technique
46. Visualization Techniques of Association Rules
35
Chapter 3
and (R1, b) of antecedent area is marked with red cycles and so on for
the rest to rules. Also, (R1, c) and (Rl, ql) of consequent area are marked
with black cycles because e and ql are the consequent items of Rl. The
same process is done for R2 and R3. The statistical area visualizes the
support of antecedent and consequent itemsets and furthermore the
support and confidence of the rules. Also, it is possible to add the
support of each item with its ellipse in its position. For example, the
number 3 beside the ellipse of the item a in Rl represents the support of
the item a and so on for each items. Figure (3.13) depicts the general
structure of MARVT. This structure preserves the same pertaining
regions; consequent, antecedent, and statistical regions.
49. Conclusion
38
Chapter 4
Chapter four
Summary and Future work
4.1 introductions
In chapter three, the most important techniques which visualized
the association rules are presented. In this chapter, the summary of these
techniques by Review the most important advantages and disadvantages
of these techniques,
4.2 Summary
Summary by review of the most important characteristics of the
previous techniques.
1- Visualize one-to- one, many-to-one, many-to-many
relationships.
4.2.1 Rule Table
2- Ability to sort the results by the column of interest.
3- Visualize full details for the rule (antecedent, consequent, support,
confidence).
4- Display the limited number of rules.
5- Its main limitation is the close resemblance to the original row
textual form so that the user can inspect only few rules without
having a global view of all the information.
6- Not interacting.
50. Conclusion
39
Chapter 4
1- Effective technique to show one-to- one binary relationship.
4.2.2 Two-Dimensional Matrix
2- Break down when we need to Visualize many-to-one, many-to-
many relationships.
3- Visualize full details for the rule (antecedent, consequent,
support, confidence).
4- Object occlusion, especially when multiple icons are used to
depict different metadata values on the matrix tiles.
5- Limited number of rule.
6- Not interacting.
1- Visualize one-to- one, many-to-one relationships.
4.2.3 Directed Graph
2- Display the limited number of rules.
3- Lacks a clear representation the
4-
support and confidence.
Edges overlap with each other to
5- Not interacting.
different rules.
1- Visualize many-to-one relationships.
4.2.4 Rule-to-Item Visualization Technique
2- Break down when we need to Visualize many-to-many
relationships.
3- No upper limit on the number of items in an antecedent.
4- Clearly shown to the individual items within an antecedent group.
5- No new antecedent groups are created because of the multiple
antecedent items in association rules.
6- No Object occlusion.
7- Deterioration of the naturalness of the rule's parts sequence
8- Interleaving of the items of the antecedent and consequent,
although they are given different colors.
9- Interacting.
51. Conclusion
40
Chapter 4
1- Visualize one-to- one, many-to-one, many-to-many relationships.
4.2.5 Parallel Coordinates
2- Visualize full details for the rule (antecedent, consequent, support,
confidence).
3- Visual rules overlap
4- Object occlusion.
with each other.
5- Lacks a clear representation the support and confidence figure
(4.1).
Figure (4.1) The rules overlap and lack of representation is clear for the
support and confidence
1- Visualize one-to- one, many-to-one, many-to-many relationships.
4.2.6 Mosaic Plot
2- Restricted to two attributes on the left side of the association rule.
3- Visualize one rule in time.
4- Difficult to understand and implementation.
5- Lacks a clear representation the support and confidence.
52. Conclusion
41
Chapter 4
1- Visualize one-to- one, many-to-one, many-to-many relationships.
4.2.7 Double Decker Plot
2- Show more than two attributes on the left side.
3- Visualize one rule in time.
4- Lacks a clear representation the
5- Difficult to understand and implementation.
support and confidence.
1- Best technique to Visualize many-to-many relationships.
4.2.8 Many to Many AR Visualization Technique
2- Visualize full details for the rule (antecedent, consequent,
support, confidence).
3- No Object occlusion.
4- No upper limit on the number of items in an antecedent.
5- Clear representation the
6- Interacting.
support and confidence.
7- Flexible to visualize more statistical information.
8- It is possible to display the order of the rule.
4.3 Future work
The exploration of large data sets is an important but difficult problem.
Information visualization techniques can be useful in solving this
problem. Visual data exploration has a high potential, and many
applications such as fraud detection and data mining can use information
visualization technology for improved data analysis.
Avenues for future work include the tight integration of
visualization techniques with traditional techniques from such
disciplines as statistics, machine learning, operations research, and
simulation. Integration of visualization techniques and these more
established methods would combine fast automatic data mining
algorithms with the intuitive power of the human mind, improving the
quality and speed of the data mining process. Visual data mining
techniques also need to be tightly integrated with the systems used to
manage the vast amounts of relational and semi structured information,
including database management and data warehouse systems. The
ultimate goal is to bring the power of visualization technology to every
desktop to allow a better, faster and more intuitive exploration of very
large data resources. This will not only be valuable in an economic sense
but will also stimulate and delight the user.
53. 42
References
[1] Alfred Inselberg, “Parallel Coordinates: Visual Multidimensional
Geometry and Its Application”, University of San Francisco, 2009.
[2] Alfred Inselberg, “Visualizing high dimensional datasets and
multivariate relations”, (tutorial).In: Proc. 6th
[4] B. Bustos, D. KeIrn, C. Panse, T Schreck, “ Pattern
Visualization",
ACMSIGKDD Inter. Conf. on
Knowledge Discovery and Data Mining (KDD 2000), Boston, MA (2000).
[3] Anil K. Jain and Richard C. Dubes, “Algorithms for Clustering Data”,
Prentice Hall, 1988.
wawTyniuk}@dbvis.infUlUkonslanz., 2003.
[5] Cheung D.W., Ng V., Fu A.W. and Fu Y., “Efficient Mining of
Association Rules in Distributed Databases”, Special Issue in ata
ining”,IEEE Transaction on Knowledge and Data Engineering, IEEE
Computer Society, 1996.
[6] Daniel Keim and Matthew Ward, “Visual Data MiningTechniques “,
University of Konstanz, Germany and Worcester Polytechnic Institute,
USA 2002.
[7] D. Bruzzese, C. Davino, “Visual Post-Analysis of Association Rules”,
Dept. of athematics and Statistics, University of Naples Federico, Italy,
{dbruzzes, cdavino !aunina.it, 2002.
54. 43
[8] D. Keim, "Designing fuel-Oriented Visualization Techniques” ,
University of Florida,,2000.
[9] Gershon N., Eick S. G., and Card S., “Information Visualization”, ACM
Interactions, vol. 5, no. 2, pp. 9-15, March/April 1998.
[10] G. Karypis and V. Kumar, “Scalable Parallel Data Mining for
Association Rules”, University Arizona,2000.
[11] H. Hofmann, A. Siebes, and A. Wilhelm, “Visualizing association
rules with interactive mosaic plots”, SIGKDD Int. Conf. On Knowledge
Discovery & Data Mining (KDD 2000), Boston, MA, 2000.
[12] J.Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate
generation”. In Proc. 2000 ACM-SIGMOD Int. Conf. Management of Data
(SIGMOD’00, Dallas, TX, May 2000.
[13] Martin, A., Ward, M.O.: High dimensional brushing for interactive
exploration of multivariate data, In: Proc. IEEE Conf. on Visualization,
Atlanta,(1995).
[14] Matthias Schubert, “Advanced Data Mining Techniques for Compound
Objects”, Maximilians- University¨, 2004.
[15] M. Deshpande and G. Karypis. ”Evaluation of Techniques for
lassifying Biological equences”. Taipei, Taiwan2002.
[16] Michael Hahsler and Sudheer Chelluboina, “Visualizing Association
Rules: Introduction to theR-extension Package arulesViz”, Southern
Methodist University 2004.
55. 44
[17] M. J. Zaki and C. J. Hsiao. CHARM: “An efficient algorithm for closed
itemset mining”. In Proc. 2002 SIAM Int. Conf. Data Mining (SDM’02),
pages 457–473, Arlington, VA, April 2002.
[18] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar,” Introduction to
Data Mining”, University of Minnesota , 2005.
[19] P. C Wong, P. Whitney, J. Thomas, "Visualizing Anociation Rules for
Text Mining", Pacific Northwest National Laboratory, 2000.
[20] Rakesh Agrawal Ramakrishnan Srikant, “Fast Algorithms for Mining
Association Rules”, IBM Almaden Research Center 1994.
[21] Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami:” Mining
Association Rules between Sets of Items in Large Databases”. SIGMOD
Conference 1993.
[22] Redpath, B. Sriruvasan, "Criteria for Comparati"e Study of
VISualization Techniques in Data mining", IEEE 3..1 into Conf On
Intelligent System, Tulsa, USA, 2003.
[23] S. G. Inc. Mineset. http://www.sgi.com/software/mineset, 2001.
[24] Simeon J. Simoff, Michael H. Böhlen, “Visual Data Mining”,
University ofWestern Sydney,1998.
[25] Stefanos Manganaris. “Supervised Classification with Temporal Data”,
PhD thesis, School of Engineering, Vanderbilt University, 1997.
56. 45
[26] Thomas S., “Architectures and Optimizations for Integrating Data
Mining Algorithms with Database Systems”, Ph.D. dissertation, University
of Florida, Gainesville, 1998.
[27] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (Editors).
“Advances in Knowledge Discovery and Data Mining”, Menlo Park, 1996.
[28] U. M. Fan-ad, G. Grinstein, "Information Visualization in Dara Mining
and Knowledge Discovery", Morgan Kaufman, San Francisco (CA), 2004.
[29] vincent wing-sing cho ,”knowledge discovery from distributed and
textual data” , Hong Kong University of Science and Technology , 1999.
[30] http://en.wikipedia.org/wiki/Association_rule_learning.