In the recent years the scope of data mining has evolved into an active area of research because of the previously unknown and interesting knowledge from very large database collection. The data mining is applied on a variety of applications in multiple domains like in business, IT and many more sectors. In Data Mining the major problem which receives great attention by the community is the classification of the data. The classification of data should be such that it could be they can be easily verified and should be easily interpreted by the humans. In this paper we would be studying various data mining techniques so that we can find few combinations for enhancing the hybrid technique which would be having multiple techniques involved so enhance the usability of the application. We would be studying CHARM Algorithm, CM-SPAM Algorithm, Apriori Algorithm, MOPNAR Algorithm and the Top K Rules.
A Survey on Frequent Patterns To Optimize Association RulesIRJET Journal
ย
This document discusses algorithms for mining association rules from transactional databases. It first provides background on association rule mining and frequent itemset mining. It then reviews the Apriori algorithm and FP-Growth algorithm, two classical algorithms for mining frequent itemsets. The document also surveys other association rule mining techniques proposed in literature. Finally, it proposes a genetic algorithm approach to optimize association rule mining by minimizing the number of rules generated.
Data Mining For Supermarket Sale Analysis Using Association Ruleijtsrd
ย
Data mining is the novel technology of discovering the important information from the data repository which is widely used in almost all fields Recently, mining of databases is very essential because of growing amount of data due to its wide applicability in retail industries in improving marketing strategies. Analysis of past transaction data can provide very valuable information on customer behavior and business decisions. The amount of data stored grows twice as fast as the speed of the fastest processor available to analyze it.Its main purpose is to find the association relationship among the large number of database items. It is used to describe the patterns of customers purchase in the supermarket. This is presented in this paper. Rajeshri Shelke"Data Mining For Supermarket Sale Analysis Using Association Rule" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-1 | Issue-4 , June 2017, URL: http://www.ijtsrd.com/papers/ijtsrd94.pdf http://www.ijtsrd.com/engineering/computer-engineering/94/data-mining-for-supermarket-sale-analysis-using-association-rule/rajeshri-shelke
In this paper, we present a literature survey of existing frequent item set mining algorithms. The concept of frequent item set mining is also discussed in brief. The working procedure of some modern frequent item set mining techniques is given. Also the merits and demerits of each method are described. It is found that the frequent item set mining is still a burning research topic.
Discovering Frequent Patterns with New Mining ProcedureIOSR Journals
ย
This document provides a summary of existing algorithms for discovering frequent patterns in transactional datasets. It begins with an introduction to the problem of mining frequent itemsets and association rules. It then describes the Apriori algorithm, which is a seminal and classical level-wise algorithm for mining frequent itemsets. The document notes some limitations of Apriori when applied to large datasets, including increased computational cost due to many database scans and large candidate sets. It then briefly describes the FP-Growth algorithm as an alternative pattern growth approach. The remainder of the document focuses on improvements made to Apriori, including the Direct Hashing and Pruning (DHP) algorithm, which aims to reduce the candidate set size to improve efficiency.
Data Mining is an important aspect for any business. Most of the management level decisions are based on the process of Data Mining. One of such aspect is the association between different sale products i.e. what is the actual support of a product respected to the other product. This concept is called Association Mining. According to this concept we define the process of estimating the sale of one product respective to the other product. We are proposing an association rule based on the concept of Hardware support. In this concept we first maintain the database and compare it with systolic array after this a pruning process is being performed to filter the database and to remove the rarely used items. Finally the data is indexed according to hashing technique and the decision is performed in terms of support count. Krishan Rohilla | Shabnam Kumari | Reema"Data Mining based on Hashing Technique" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-1 | Issue-4 , June 2017, URL: http://www.ijtsrd.com/papers/ijtsrd82.pdf http://www.ijtsrd.com/computer-science/data-miining/82/data-mining-based-on-hashing-technique/krishan-rohilla
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...IRJET Journal
ย
This document discusses classifying patterns from online shopping data using data mining techniques. It proposes using the Apriori algorithm to mine frequent patterns from transaction data stored in a data warehouse. Patterns mined from the data warehouse using Apriori would then be stored in a pattern warehouse. This would allow users to view product details and related patterns when browsing items online. The system aims to efficiently analyze large amounts of user data to discover useful patterns for improving the online shopping experience.
- Data mining involves discovering novel patterns from large databases using algorithms and computers. It aims to find hidden patterns in datasets by analyzing attribute correlations.
- Common data mining tasks include classification, regression, clustering, association analysis, and anomaly detection. These can be used to solve problems like product recommendations, student enrollment predictions, and fraud detection.
- The key steps in data mining typically involve data preparation, exploration, model development, and result interpretation. Association rule mining is commonly used and aims to find relationships between variables in large datasets.
In the recent years the scope of data mining has evolved into an active area of research because of the previously unknown and interesting knowledge from very large database collection. The data mining is applied on a variety of applications in multiple domains like in business, IT and many more sectors. In Data Mining the major problem which receives great attention by the community is the classification of the data. The classification of data should be such that it could be they can be easily verified and should be easily interpreted by the humans. In this paper we would be studying various data mining techniques so that we can find few combinations for enhancing the hybrid technique which would be having multiple techniques involved so enhance the usability of the application. We would be studying CHARM Algorithm, CM-SPAM Algorithm, Apriori Algorithm, MOPNAR Algorithm and the Top K Rules.
A Survey on Frequent Patterns To Optimize Association RulesIRJET Journal
ย
This document discusses algorithms for mining association rules from transactional databases. It first provides background on association rule mining and frequent itemset mining. It then reviews the Apriori algorithm and FP-Growth algorithm, two classical algorithms for mining frequent itemsets. The document also surveys other association rule mining techniques proposed in literature. Finally, it proposes a genetic algorithm approach to optimize association rule mining by minimizing the number of rules generated.
Data Mining For Supermarket Sale Analysis Using Association Ruleijtsrd
ย
Data mining is the novel technology of discovering the important information from the data repository which is widely used in almost all fields Recently, mining of databases is very essential because of growing amount of data due to its wide applicability in retail industries in improving marketing strategies. Analysis of past transaction data can provide very valuable information on customer behavior and business decisions. The amount of data stored grows twice as fast as the speed of the fastest processor available to analyze it.Its main purpose is to find the association relationship among the large number of database items. It is used to describe the patterns of customers purchase in the supermarket. This is presented in this paper. Rajeshri Shelke"Data Mining For Supermarket Sale Analysis Using Association Rule" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-1 | Issue-4 , June 2017, URL: http://www.ijtsrd.com/papers/ijtsrd94.pdf http://www.ijtsrd.com/engineering/computer-engineering/94/data-mining-for-supermarket-sale-analysis-using-association-rule/rajeshri-shelke
In this paper, we present a literature survey of existing frequent item set mining algorithms. The concept of frequent item set mining is also discussed in brief. The working procedure of some modern frequent item set mining techniques is given. Also the merits and demerits of each method are described. It is found that the frequent item set mining is still a burning research topic.
Discovering Frequent Patterns with New Mining ProcedureIOSR Journals
ย
This document provides a summary of existing algorithms for discovering frequent patterns in transactional datasets. It begins with an introduction to the problem of mining frequent itemsets and association rules. It then describes the Apriori algorithm, which is a seminal and classical level-wise algorithm for mining frequent itemsets. The document notes some limitations of Apriori when applied to large datasets, including increased computational cost due to many database scans and large candidate sets. It then briefly describes the FP-Growth algorithm as an alternative pattern growth approach. The remainder of the document focuses on improvements made to Apriori, including the Direct Hashing and Pruning (DHP) algorithm, which aims to reduce the candidate set size to improve efficiency.
Data Mining is an important aspect for any business. Most of the management level decisions are based on the process of Data Mining. One of such aspect is the association between different sale products i.e. what is the actual support of a product respected to the other product. This concept is called Association Mining. According to this concept we define the process of estimating the sale of one product respective to the other product. We are proposing an association rule based on the concept of Hardware support. In this concept we first maintain the database and compare it with systolic array after this a pruning process is being performed to filter the database and to remove the rarely used items. Finally the data is indexed according to hashing technique and the decision is performed in terms of support count. Krishan Rohilla | Shabnam Kumari | Reema"Data Mining based on Hashing Technique" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-1 | Issue-4 , June 2017, URL: http://www.ijtsrd.com/papers/ijtsrd82.pdf http://www.ijtsrd.com/computer-science/data-miining/82/data-mining-based-on-hashing-technique/krishan-rohilla
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...IRJET Journal
ย
This document discusses classifying patterns from online shopping data using data mining techniques. It proposes using the Apriori algorithm to mine frequent patterns from transaction data stored in a data warehouse. Patterns mined from the data warehouse using Apriori would then be stored in a pattern warehouse. This would allow users to view product details and related patterns when browsing items online. The system aims to efficiently analyze large amounts of user data to discover useful patterns for improving the online shopping experience.
- Data mining involves discovering novel patterns from large databases using algorithms and computers. It aims to find hidden patterns in datasets by analyzing attribute correlations.
- Common data mining tasks include classification, regression, clustering, association analysis, and anomaly detection. These can be used to solve problems like product recommendations, student enrollment predictions, and fraud detection.
- The key steps in data mining typically involve data preparation, exploration, model development, and result interpretation. Association rule mining is commonly used and aims to find relationships between variables in large datasets.
The document summarizes research on improving the Apriori algorithm for association rule mining. It first provides background on association rule mining and the standard Apriori algorithm. It then discusses several proposed improvements to Apriori, including reducing the number of database scans, shrinking the candidate itemset size, and using techniques like pruning and hash trees. Finally, it outlines some open challenges for further optimizing association rule mining.
This document summarizes research on improving the Apriori algorithm for mining association rules from transactional databases. It first provides background on association rule mining and describes the basic Apriori algorithm. The Apriori algorithm finds frequent itemsets by multiple passes over the database but has limitations of increased search space and computational costs as the database size increases. The document then reviews research on variations of the Apriori algorithm that aim to reduce the number of database scans, shrink the candidate sets, and facilitate support counting to improve performance.
Data Mining plays an important role in extracting patterns and other information from data. The Apriori Algorithm has been the most popular techniques infinding frequent patterns. However, Apriori Algorithm scans the database many times leading to large I/O. This paper is proposed to overcome the limitaions of Apriori Algorithm while improving the overall speed of execution for all variations in โminimum supportโ. It is aimed to reduce the number of scans required to find frequent patters.
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...ijsrd.com
ย
In the development, standardization and implementation of LTE Networks based on Orthogonal Freq. Division Multiple Access (OFDMA), simulations are necessary to test as well as optimize algorithms and procedures before real time establishment. This can be done by both Physical Layer (Link-Level) and Network (System-Level) context. This paper proposes Network Simulator 3 (NS-3) which is capable of evaluating the performance of the Downlink Shared Channel of LTE networks and comparing it with available MatLab based LTE System Level Simulator performance.
A Brief Overview On Frequent Pattern Mining AlgorithmsSara Alvarez
ย
This document provides an overview of frequent pattern mining algorithms. It discusses that frequent pattern mining finds inherent regularities in data and plays an essential role in data mining tasks. The document then describes several sequential pattern mining algorithms such as AIS, SETM, Apriori and some of its variations that improve efficiency. It also discusses parallel pattern mining algorithms and some challenges in the field of frequent pattern mining.
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...BRNSSPublicationHubI
ย
This document presents an improved Apriori algorithm for generating frequent item sets on large datasets using Hadoop MapReduce. The classical Apriori algorithm suffers from repeated database scans, high candidate generation costs, and memory issues. The proposed improved Apriori algorithm aims to address these issues by leveraging Hadoop MapReduce to parallelize the processing and reduce unnecessary database scans. It presents the pseudocode for the classical and improved algorithms. The improved algorithm is evaluated to show it provides better performance than the classical Apriori algorithm in terms of time and number of iterations required.
A NOVEL APPROACH TO MINE FREQUENT PATTERNS FROM LARGE VOLUME OF DATASET USING...IAEME Publication
ย
In this paper, MDL based reduction in frequent pattern is presented. The ideal outcome of any pattern mining process is to explore the data in new insights. And also, we need to eliminate the non-interesting patterns that describe noise. The major problem in frequent pattern mining is to identify the interesting patterns. Instead of performing association rule mining on all the frequent item sets, it is feasible to select a sub set of frequent item sets and perform the mining task. Selecting a small set of frequent item sets from large amount of interesting ones is a difficult task. In our approach, MDL based algorithm is used for reducing the number of frequent item sets to be used for association rule mining is presented.
Machine learning techniques can be used to detect outliers in trading data. The proposed system would use machine learning to train a model to identify outlier trades from input data and flag them for administrator approval. If approved, the trade would be submitted and the model retrained; if denied, the model would not be retrained. This approach allows the model to learn over time to better identify outlier trades. Support vector machines are one machine learning technique that could be used to classify trades as outliers or not based on training data. Identifying and addressing outliers can help reduce human errors and fraudulent trading activities.
BINARY DECISION TREE FOR ASSOCIATION RULES MINING IN INCREMENTAL DATABASESIJDKP
ย
This document proposes the Binary Decision Tree (BDT) algorithm for efficiently mining association rules from incremental databases. The BDT algorithm scans the database only once to construct a dynamic growing binary tree, allowing it to handle changes to the database without reprocessing the entire data. It overcomes limitations of previous algorithms like Apriori and FP-Growth that require reprocessing when the database is updated. The BDT algorithm represents each transaction as a bit pattern and uses the tree structure to recursively traverse patterns to determine support counts for item sets without generating candidates. This makes it suitable for mining association rules from incremental databases.
BINARY DECISION TREE FOR ASSOCIATION RULES MINING IN INCREMENTAL DATABASES IJDKP
ย
This research paper proposes an algorithm to find association rules for incremental databases. Most of the
transaction databases are often dynamic. Suppose consider super market customers daily purchase
transactions. Day to day customerโs behaviour to purchase items may change and new products replace
old products. In this scenario static data mining algorithms doesn't make good significance. If an algorithm
continuously learns day to day, then we can get most updated knowledge. This is very much helpful in
present fast updating world. Famous and benchmarked algorithms for Association rules mining are
Apriory and FP- Growth. However, the major drawback in Appriory and FP-Growth is, they must be
rebuilt all over again once the original database is changed. Therefore, in this paper we introduce an
efficient algorithm called Binary Decision Tree (BDT) to process incremental data. To process
continuously data we need so much of processing and storage resources. In this algorithm we scan data
base only once by which we construct dynamic growing binary tree to find association rules with better
performance and optimum storage. We can apply for static data also, but our main intension is to give
optimum solution for incremental data.
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
A comprehensive study of major techniques of multi level frequent pattern min...eSAT Journals
ย
Abstract Frequent pattern mining has become one of the most popular data mining approaches for the analysis of purchasing patterns. There are techniques such as Apriority and FP-Growth, which were typically restricted to a single concept level. We extend our research to study Multi - level frequent patterns in multi-level environments. Mining Multi-level frequent pattern may lead to the discovery of mining patterns at different levels of hierarchy. In this study, we describe the main techniques used to solve these problems and give a comprehensive survey of the most influential algorithms That were proposed during the last decade.
Index Terms: Data Mining, Data Transformation, Frequent Pattern Mining (FPM), Transactional Database.
A comprehensive study of major techniques of multi level frequent pattern min...eSAT Publishing House
ย
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
International journal of computer science and innovation vol 2015-n1-paper4sophiabelthome
ย
This paper discusses applications of association rule mining (ARM). Section 1 provides an overview of ARM and describes it as an important data mining technique. Section 2 reviews literature on ARM and discusses various ARM algorithms. Section 3 describes several applications of ARM in detail, including market basket analysis, building intelligent transportation systems, medical diagnosis, and web log analysis. ARM is useful for discovering frequent patterns and relationships in large datasets across many domains.
Output Privacy Protection With Pattern-Based Heuristic Algorithmijcsit
ย
Privacy Preserving Data Mining(PPDM) is an ongoing research area aimed at bridging the gap between
the collaborative data mining and data confidentiality There are many different approaches which have
been adopted for PPDM, of them the rule hiding approach is used in this article. This approach ensures
output privacy that prevent the mined patterns(itemsets) from malicious inference problems. An efficient
algorithm named as Pattern-based Maxcover Algorithm is proposed with experimental results. This
algorithm minimizes the dissimilarity between the source and the released database; Moreover the
patterns protected cannot be retrieved from the released database by an adversary or counterpart even
with an arbitrarily low support threshold.
An improved apriori algorithm for association rulesijnlc
ย
There are several mining algorithms of association rules. One of the most popular algorithms is Apriori
that is used to extract frequent itemsets from large database and getting the association rule for
discovering the knowledge. Based on this algorithm, this paper indicates the limitation of the original
Apriori algorithm of wasting time for scanning the whole database searching on the frequent itemsets, and
presents an improvement on Apriori by reducing that wasted time depending on scanning only some
transactions. The paper shows by experimental results with several groups of transactions, and with
several values of minimum support that applied on the original Apriori and our implemented improved
Apriori that our improved Apriori reduces the time consumed by 67.38% in comparison with the original
Apriori, and makes the Apriori algorithm more efficient and less time consuming
This document proposes improvements to existing algorithms for multidimensional sequential pattern mining. It summarizes existing research that combines sequential pattern mining with multidimensional analysis or incorporates multidimensional information into sequential pattern mining. The proposed algorithm first mines minimal atomic frequent sequences from the data and prunes hierarchies using an adapted PrefixSpan algorithm to efficiently generate sequential patterns associated with multidimensional information. This approach aims to improve over existing methods by leveraging the efficiency of PrefixSpan for multidimensional sequential pattern mining.
Weighted frequent pattern mining is suggested to find out more important frequent pattern by considering different weights of each item. Weighted Frequent Patterns are generated in weight ascending and frequency descending order by using prefix tree structure. These generated weighted frequent patterns are applied to maximal frequent item set mining algorithm. Maximal frequent pattern mining can reduces the number of frequent patterns and keep sufficient result information. In this paper, we proposed an efficient algorithm to mine maximal weighted frequent pattern mining over data streams. A new efficient data structure i.e. prefix tree and conditional tree structure is used to dynamically maintain the information of transactions. Here, three information mining strategies (i.e. Incremental, Interactive and Maximal) are presented. The detail of the algorithms is also discussed. Our study has submitted an application to the Electronic shop Market Basket Analysis. Experimental studies are performed to evaluate the good effectiveness of our algorithm..
Association rules are the main techniques to
determine the frequent item set in data mining. Apriori
algorithm is the classic algorithm of association rules, which
enumerate all of the frequent item sets. If database is large, it
takes too much time to scan the database. The improved
algorithm is verified, the results show that the improved
algorithm is reasonable and effective, and can extract more
valuable information.
The document summarizes research on improving the Apriori algorithm for association rule mining. It first provides background on association rule mining and the standard Apriori algorithm. It then discusses several proposed improvements to Apriori, including reducing the number of database scans, shrinking the candidate itemset size, and using techniques like pruning and hash trees. Finally, it outlines some open challenges for further optimizing association rule mining.
This document summarizes research on improving the Apriori algorithm for mining association rules from transactional databases. It first provides background on association rule mining and describes the basic Apriori algorithm. The Apriori algorithm finds frequent itemsets by multiple passes over the database but has limitations of increased search space and computational costs as the database size increases. The document then reviews research on variations of the Apriori algorithm that aim to reduce the number of database scans, shrink the candidate sets, and facilitate support counting to improve performance.
Data Mining plays an important role in extracting patterns and other information from data. The Apriori Algorithm has been the most popular techniques infinding frequent patterns. However, Apriori Algorithm scans the database many times leading to large I/O. This paper is proposed to overcome the limitaions of Apriori Algorithm while improving the overall speed of execution for all variations in โminimum supportโ. It is aimed to reduce the number of scans required to find frequent patters.
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...ijsrd.com
ย
In the development, standardization and implementation of LTE Networks based on Orthogonal Freq. Division Multiple Access (OFDMA), simulations are necessary to test as well as optimize algorithms and procedures before real time establishment. This can be done by both Physical Layer (Link-Level) and Network (System-Level) context. This paper proposes Network Simulator 3 (NS-3) which is capable of evaluating the performance of the Downlink Shared Channel of LTE networks and comparing it with available MatLab based LTE System Level Simulator performance.
A Brief Overview On Frequent Pattern Mining AlgorithmsSara Alvarez
ย
This document provides an overview of frequent pattern mining algorithms. It discusses that frequent pattern mining finds inherent regularities in data and plays an essential role in data mining tasks. The document then describes several sequential pattern mining algorithms such as AIS, SETM, Apriori and some of its variations that improve efficiency. It also discusses parallel pattern mining algorithms and some challenges in the field of frequent pattern mining.
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...BRNSSPublicationHubI
ย
This document presents an improved Apriori algorithm for generating frequent item sets on large datasets using Hadoop MapReduce. The classical Apriori algorithm suffers from repeated database scans, high candidate generation costs, and memory issues. The proposed improved Apriori algorithm aims to address these issues by leveraging Hadoop MapReduce to parallelize the processing and reduce unnecessary database scans. It presents the pseudocode for the classical and improved algorithms. The improved algorithm is evaluated to show it provides better performance than the classical Apriori algorithm in terms of time and number of iterations required.
A NOVEL APPROACH TO MINE FREQUENT PATTERNS FROM LARGE VOLUME OF DATASET USING...IAEME Publication
ย
In this paper, MDL based reduction in frequent pattern is presented. The ideal outcome of any pattern mining process is to explore the data in new insights. And also, we need to eliminate the non-interesting patterns that describe noise. The major problem in frequent pattern mining is to identify the interesting patterns. Instead of performing association rule mining on all the frequent item sets, it is feasible to select a sub set of frequent item sets and perform the mining task. Selecting a small set of frequent item sets from large amount of interesting ones is a difficult task. In our approach, MDL based algorithm is used for reducing the number of frequent item sets to be used for association rule mining is presented.
Machine learning techniques can be used to detect outliers in trading data. The proposed system would use machine learning to train a model to identify outlier trades from input data and flag them for administrator approval. If approved, the trade would be submitted and the model retrained; if denied, the model would not be retrained. This approach allows the model to learn over time to better identify outlier trades. Support vector machines are one machine learning technique that could be used to classify trades as outliers or not based on training data. Identifying and addressing outliers can help reduce human errors and fraudulent trading activities.
BINARY DECISION TREE FOR ASSOCIATION RULES MINING IN INCREMENTAL DATABASESIJDKP
ย
This document proposes the Binary Decision Tree (BDT) algorithm for efficiently mining association rules from incremental databases. The BDT algorithm scans the database only once to construct a dynamic growing binary tree, allowing it to handle changes to the database without reprocessing the entire data. It overcomes limitations of previous algorithms like Apriori and FP-Growth that require reprocessing when the database is updated. The BDT algorithm represents each transaction as a bit pattern and uses the tree structure to recursively traverse patterns to determine support counts for item sets without generating candidates. This makes it suitable for mining association rules from incremental databases.
BINARY DECISION TREE FOR ASSOCIATION RULES MINING IN INCREMENTAL DATABASES IJDKP
ย
This research paper proposes an algorithm to find association rules for incremental databases. Most of the
transaction databases are often dynamic. Suppose consider super market customers daily purchase
transactions. Day to day customerโs behaviour to purchase items may change and new products replace
old products. In this scenario static data mining algorithms doesn't make good significance. If an algorithm
continuously learns day to day, then we can get most updated knowledge. This is very much helpful in
present fast updating world. Famous and benchmarked algorithms for Association rules mining are
Apriory and FP- Growth. However, the major drawback in Appriory and FP-Growth is, they must be
rebuilt all over again once the original database is changed. Therefore, in this paper we introduce an
efficient algorithm called Binary Decision Tree (BDT) to process incremental data. To process
continuously data we need so much of processing and storage resources. In this algorithm we scan data
base only once by which we construct dynamic growing binary tree to find association rules with better
performance and optimum storage. We can apply for static data also, but our main intension is to give
optimum solution for incremental data.
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
A comprehensive study of major techniques of multi level frequent pattern min...eSAT Journals
ย
Abstract Frequent pattern mining has become one of the most popular data mining approaches for the analysis of purchasing patterns. There are techniques such as Apriority and FP-Growth, which were typically restricted to a single concept level. We extend our research to study Multi - level frequent patterns in multi-level environments. Mining Multi-level frequent pattern may lead to the discovery of mining patterns at different levels of hierarchy. In this study, we describe the main techniques used to solve these problems and give a comprehensive survey of the most influential algorithms That were proposed during the last decade.
Index Terms: Data Mining, Data Transformation, Frequent Pattern Mining (FPM), Transactional Database.
A comprehensive study of major techniques of multi level frequent pattern min...eSAT Publishing House
ย
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
International journal of computer science and innovation vol 2015-n1-paper4sophiabelthome
ย
This paper discusses applications of association rule mining (ARM). Section 1 provides an overview of ARM and describes it as an important data mining technique. Section 2 reviews literature on ARM and discusses various ARM algorithms. Section 3 describes several applications of ARM in detail, including market basket analysis, building intelligent transportation systems, medical diagnosis, and web log analysis. ARM is useful for discovering frequent patterns and relationships in large datasets across many domains.
Output Privacy Protection With Pattern-Based Heuristic Algorithmijcsit
ย
Privacy Preserving Data Mining(PPDM) is an ongoing research area aimed at bridging the gap between
the collaborative data mining and data confidentiality There are many different approaches which have
been adopted for PPDM, of them the rule hiding approach is used in this article. This approach ensures
output privacy that prevent the mined patterns(itemsets) from malicious inference problems. An efficient
algorithm named as Pattern-based Maxcover Algorithm is proposed with experimental results. This
algorithm minimizes the dissimilarity between the source and the released database; Moreover the
patterns protected cannot be retrieved from the released database by an adversary or counterpart even
with an arbitrarily low support threshold.
An improved apriori algorithm for association rulesijnlc
ย
There are several mining algorithms of association rules. One of the most popular algorithms is Apriori
that is used to extract frequent itemsets from large database and getting the association rule for
discovering the knowledge. Based on this algorithm, this paper indicates the limitation of the original
Apriori algorithm of wasting time for scanning the whole database searching on the frequent itemsets, and
presents an improvement on Apriori by reducing that wasted time depending on scanning only some
transactions. The paper shows by experimental results with several groups of transactions, and with
several values of minimum support that applied on the original Apriori and our implemented improved
Apriori that our improved Apriori reduces the time consumed by 67.38% in comparison with the original
Apriori, and makes the Apriori algorithm more efficient and less time consuming
This document proposes improvements to existing algorithms for multidimensional sequential pattern mining. It summarizes existing research that combines sequential pattern mining with multidimensional analysis or incorporates multidimensional information into sequential pattern mining. The proposed algorithm first mines minimal atomic frequent sequences from the data and prunes hierarchies using an adapted PrefixSpan algorithm to efficiently generate sequential patterns associated with multidimensional information. This approach aims to improve over existing methods by leveraging the efficiency of PrefixSpan for multidimensional sequential pattern mining.
Weighted frequent pattern mining is suggested to find out more important frequent pattern by considering different weights of each item. Weighted Frequent Patterns are generated in weight ascending and frequency descending order by using prefix tree structure. These generated weighted frequent patterns are applied to maximal frequent item set mining algorithm. Maximal frequent pattern mining can reduces the number of frequent patterns and keep sufficient result information. In this paper, we proposed an efficient algorithm to mine maximal weighted frequent pattern mining over data streams. A new efficient data structure i.e. prefix tree and conditional tree structure is used to dynamically maintain the information of transactions. Here, three information mining strategies (i.e. Incremental, Interactive and Maximal) are presented. The detail of the algorithms is also discussed. Our study has submitted an application to the Electronic shop Market Basket Analysis. Experimental studies are performed to evaluate the good effectiveness of our algorithm..
Association rules are the main techniques to
determine the frequent item set in data mining. Apriori
algorithm is the classic algorithm of association rules, which
enumerate all of the frequent item sets. If database is large, it
takes too much time to scan the database. The improved
algorithm is verified, the results show that the improved
algorithm is reasonable and effective, and can extract more
valuable information.
Similar to KIT-601 Lecture Notes-UNIT-4.pdf Frequent Itemsets and Clustering (20)
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfDr. Radhey Shyam
ย
The document provides an overview of data analytics and big data concepts. It discusses the characteristics of big data, including the four V's of volume, velocity, variety and veracity. It describes different types of data like structured, semi-structured and unstructured data. The document also introduces popular big data platforms like Hadoop, Spark and Cassandra. Finally, it outlines key reasons for the need of data analytics, such as enabling better decision making and improving organizational efficiency.
SE-UNIT-3-II-Software metrics, numerical and their solutions.pdfDr. Radhey Shyam
ย
1) The document discusses software metrics and Halstead's software science measures including program length, vocabulary, volume, difficulty, and effort. It provides the formulas to calculate these measures and an example calculation for a C function.
2) Guidelines are provided for identifying operands and operators when applying Halstead's measures to source code. Various program elements like variables, functions, and statements are classified.
3) References on software engineering and metrics are listed at the end.
Introduction to Data Analytics and data analytics life cycleDr. Radhey Shyam
ย
The document provides an overview of data analytics and big data concepts. It discusses the characteristics of big data, including the four V's of volume, velocity, variety and veracity. It also describes different types of data like structured, semi-structured and unstructured data. The document then introduces big data platforms and tools like Hadoop, Spark and Cassandra. Finally, it discusses the need for data analytics in business, including enabling better decision making and improving efficiency.
This document provides an overview of database normalization concepts. It begins by defining normalization as the process of organizing data in a database to eliminate redundant data and ensure data dependencies are properly represented by constraints. It then discusses first normal form (1NF), which requires each cell to contain a single value. Candidate keys and super keys are also defined. The document concludes by briefly mentioning higher normal forms up to fifth normal form (5NF) and some alternative database design approaches such as NoSQL and graph databases.
The document provides information about the syllabus for the Data Analytics (KIT-601) course. It includes 5 units that will be covered: Introduction to Data Analytics, Data Analysis techniques including regression modeling and multivariate analysis, Mining Data Streams, Frequent Itemsets and Clustering, and Frameworks and Visualization. It lists the course outcomes and Bloom's taxonomy levels. It also provides details on the topics to be covered in each unit, including proposed lecture hours, textbooks, and an evaluation scheme. The syllabus aims to discuss concepts of data analytics and apply techniques such as classification, regression, clustering, and frequent pattern mining on data.
This document provides an introduction to the concepts of data analytics and the data analytics lifecycle. It discusses big data in terms of the 4Vs - volume, velocity, variety and veracity. It also discusses other characteristics of big data like volatility, validity, variability and value. The document then discusses various concepts in data analytics like traditional business intelligence, data mining, statistical applications, predictive analysis, and data modeling. It explains how these concepts are used to analyze large datasets and derive value from big data. The goal of data analytics is to gain insights and a competitive advantage through analyzing large and diverse datasets.
This document is a slide presentation by Dr. Radhey Shyam on the topics of reinforcement learning and genetic algorithms. It discusses various types of applications that genetic algorithms can be used for, including control systems, design optimization, scheduling, robotics, machine learning, signal processing, game playing, and solving combinatorial optimization problems. Examples provided include gas pipeline control, missile evasion, aircraft design, manufacturing scheduling, neural network design, filter design, and solving the traveling salesman problem.
This document provides an overview of self-organizing maps (SOMs), a type of artificial neural network. It discusses the biological motivation for SOMs, which are inspired by self-organizing systems in the brain. The document outlines the basic architecture and learning algorithm of SOMs, including initialization, training procedures, and classification. It also reviews various properties of SOMs, such as their ability to approximate input spaces and perform topological ordering and density matching. Finally, applications of SOMs are briefly mentioned, such as for speech recognition, image analysis, and data visualization.
The document describes Convolutional Neural Networks (CNNs). It explains that CNNs are a type of neural network that uses convolutional layers, which apply filters to input data to extract features. This helps reduce the number of parameters needed compared to fully connected networks. The document provides examples of how CNNs can be used for image recognition, speech recognition, and text classification by applying filters that move across spatial or temporal dimensions of the input data.
1) The document discusses software metrics and Halstead's software science measures including program length, vocabulary, volume, difficulty, and effort. It provides the formulas to calculate these measures and an example calculation for a C function.
2) Guidelines are provided for identifying operands and operators when applying Halstead's measures to source code. Various program elements like variables, functions, and statements are classified.
3) The document also discusses other software metrics like lines of code (LOC) and function points which can be used to measure size and complexity. It provides a sample calculation of LOC and function points for a simple program.
The document provides an overview of Software Requirement Specification (SRS) and Software Quality Assurance (SQA). It discusses the importance of well-written requirements documents, as without them developers do not know what to build and customers do not know what to expect. The document also outlines different types of requirements like functional, non-functional, user and system requirements. It describes various requirements elicitation techniques like interviews, brainstorming sessions, use case approach etc. Finally, it discusses modeling requirements using tools like data flow diagrams, data dictionaries and entity relationship diagrams.
This document provides a 3 paragraph summary of a software engineering course titled "Software Engineering (KCS-601)" taught by Dr. Radhey Shyam at SRMCEM Lucknow. The course contents were compiled by Dr. Shyam and are available for students' academic use. Students can contact Dr. Shyam via email for any queries regarding the course material.
This document provides an overview of the unit 3 course material for Software Design taught by Dr. Radhey Shyam at SRMCEM Lucknow. The document discusses key concepts in software design including the importance of design, characteristics of good and bad design, coupling and cohesion, modularization, design models, high level design and architectural design. Specific topics covered include software design documentation, conceptual vs technical design, types of coupling and cohesion, advantages of modular systems, design frameworks, and strategies for design such as top-down, bottom-up, and hybrid approaches.
This document discusses image representation and description techniques. It begins by explaining that image segmentation results in a set of regions that need to be represented, often by their boundaries or internal characteristics, and described using features. Several boundary and regional representation and description methods are then outlined, including chain codes, shape numbers, Fourier descriptors, statistical moments, topology, and textures.
This document discusses image segmentation using morphological watersheds. It begins by explaining the concepts of regional minima, catchment basins, and watershed lines in a topographic representation of an image. It then describes the watershed algorithm which involves flooding the image from regional minima and building dams when flood waters would merge. The resulting dams represent the watershed lines and segmented boundaries. The document provides examples to illustrate the flooding process and discusses how markers can be used to limit oversegmentation from noise.
This document discusses image restoration and contains summaries of several lecture slides on image degradation and restoration models, noise models, and frequency domain filtering techniques for periodic noise reduction. It was compiled by Dr. Radhey Shyam with contributions from Dr. Philippe Cattin, and is intended for academic use by students to help explain basic concepts of image restoration.
The document is a unit on image enhancement from an image processing course. It was written by Dr. Radhey Shyam of the computer science department at BIET Lucknow, India. The unit introduces basic concepts of image enhancement in the spatial and frequency domains. Students will learn about arithmetic and logical operations on pixels to enhance images.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
ย
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
Discover the latest insights on Data Driven Maintenance with our comprehensive webinar presentation. Learn about traditional maintenance challenges, the right approach to utilizing data, and the benefits of adopting a Data Driven Maintenance strategy. Explore real-world examples, industry best practices, and innovative solutions like FMECA and the D3M model. This presentation, led by expert Jules Oudmans, is essential for asset owners looking to optimize their maintenance processes and leverage digital technologies for improved efficiency and performance. Download now to stay ahead in the evolving maintenance landscape.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
ย
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
Introduction- e - waste โ definition - sources of e-wasteโ hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste managementโ e-waste handling rules - waste minimization techniques for managing e-waste โ recycling of e-waste - disposal treatment methods of e- waste โ mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste โ E-waste in India- case studies.
CHINAโS GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
ย
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
chinaโs geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
artificial intelligence and data science contents.pptxGauravCar
ย
What is artificial intelligence? Artificial intelligence is the ability of a computer or computer-controlled robot to perform tasks that are commonly associated with the intellectual processes characteristic of humans, such as the ability to reason.
โบ ...
Artificial intelligence (AI) | Definitio
artificial intelligence and data science contents.pptx
ย
KIT-601 Lecture Notes-UNIT-4.pdf Frequent Itemsets and Clustering
1. A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
Data Analytics (KIT-601)
Unit-4: Frequent Itemsets and Clustering
Dr. Radhey Shyam
Professor
Department of Information Technology
SRMCEM Lucknow
(Affiliated to Dr. A.P.J. Abdul Kalam Technical University, Lucknow)
Unit-4 has been prepared and compiled by Dr. Radhey Shyam, with grateful acknowledgment to those who
made their course contents freely available or (Contributed directly or indirectly). Feel free to use this
study material for your own academic purposes. For any query, communication can be made through this
email : shyam0058@gmail.com.
April 12, 2024
2. Data Analytics (KIT 601)
Course Outcome ( CO) Bloomโs Knowledge Level (KL)
At the end of course , the student will be able to
CO 1 Discuss various concepts of data analytics pipeline K1, K2
CO 2 Apply classification and regression techniques K3
CO 3 Explain and apply mining techniques on streaming data K2, K3
CO 4 Compare different clustering and frequent pattern mining algorithms K4
CO 5 Describe the concept of R programming and implement analytics on Big data using R. K2,K3
DETAILED SYLLABUS 3-0-0
Unit Topic Proposed
Lecture
I
Introduction to Data Analytics: Sources and nature of data, classification of data
(structured, semi-structured, unstructured), characteristics of data, introduction to Big Data
platform, need of data analytics, evolution of analytic scalability, analytic process and
tools, analysis vs reporting, modern data analytic tools, applications of data analytics.
Data Analytics Lifecycle: Need, key roles for successful analytic projects, various phases
of data analytics lifecycle โ discovery, data preparation, model planning, model building,
communicating results, operationalization.
08
II
Data Analysis: Regression modeling, multivariate analysis, Bayesian modeling, inference
and Bayesian networks, support vector and kernel methods, analysis of time series: linear
systems analysis & nonlinear dynamics, rule induction, neural networks: learning and
generalisation, competitive learning, principal component analysis and neural networks,
fuzzy logic: extracting fuzzy models from data, fuzzy decision trees, stochastic search
methods.
08
III
Mining Data Streams: Introduction to streams concepts, stream data model and
architecture, stream computing, sampling data in a stream, filtering streams, counting
distinct elements in a stream, estimating moments, counting oneness in a window,
decaying window, Real-time Analytics Platform ( RTAP) applications, Case studies โ real
time sentiment analysis, stock market predictions.
08
IV
Frequent Itemsets and Clustering: Mining frequent itemsets, market based modelling,
Apriori algorithm, handling large data sets in main memory, limited pass algorithm,
counting frequent itemsets in a stream, clustering techniques: hierarchical, K-means,
clustering high dimensional data, CLIQUE and ProCLUS, frequent pattern based clustering
methods, clustering in non-euclidean space, clustering for streams and parallelism.
08
V
Frame Works and Visualization: MapReduce, Hadoop, Pig, Hive, HBase, MapR,
Sharding, NoSQL Databases, S3, Hadoop Distributed File Systems, Visualization: visual
data analysis techniques, interaction techniques, systems and applications.
Introduction to R - R graphical user interfaces, data import and export, attribute and data
types, descriptive statistics, exploratory data analysis, visualization before analysis,
analytics for unstructured data.
08
Text books and References:
1. Michael Berthold, David J. Hand, Intelligent Data Analysis, Springer
2. Anand Rajaraman and Jeffrey David Ullman, Mining of Massive Datasets, Cambridge University Press.
3. John Garrett,Data Analytics for IT Networks : Developing Innovative Use Cases, Pearson Education
Curriculum & Evaluation Scheme IT & CSI (V & VI semester) 23
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
3. A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
Unit-IV: Frequent Itemsets and
Clustering
1 Mining Frequent Itemsets
Frequent itemset mining is a popular data mining task that involves identifying sets of items that frequently
co-occur in a given dataset. In other words, it involves finding the items that occur together frequently and
then grouping them into sets of items. One way to approach this problem is by using the Apriori algorithm,
which is one of the most widely used algorithms for frequent itemset mining.
The Apriori algorithm works by iteratively generating candidate itemsets and then checking their fre-
quency against a minimum support threshold. The algorithm starts by generating all possible itemsets of
size 1 and counting their frequencies in the dataset. The itemsets that meet the minimum support threshold
are then selected as frequent itemsets. The algorithm then proceeds to generate candidate itemsets of size
2 from the frequent itemsets of size 1 and counts their frequencies. This process is repeated until no more
frequent itemsets can be generated.
However, when dealing with large datasets, this approach can become computationally expensive due
to the potentially large number of candidate itemsets that need to be generated and counted. Point-wise
frequent itemset mining is a more efficient alternative that can reduce the computational complexity of the
Apriori algorithm by exploiting the sparsity of the dataset.
Point-wise frequent itemset mining works by iterating over the transactions in the dataset and identifying
the itemsets that occur in each transaction. For each transaction, the algorithm generates a bitmap vector
where each bit corresponds to an item in the dataset, and its value is set to 1 if the item occurs in the
transaction and 0 otherwise. The algorithm then performs a bitwise AND operation between the bitmap
vectors of each transaction to identify the itemsets that occur in all the transactions. The itemsets that meet
the minimum support threshold are then selected as frequent itemsets.
The advantage of point-wise frequent itemset mining is that it avoids generating candidate itemsets that
are not present in the dataset, thereby reducing the number of itemsets that need to be generated and
counted. Additionally, point-wise frequent itemset mining can be parallelized, making it suitable for mining
large datasets on distributed systems.
In summary, point-wise frequent itemset mining is an efficient alternative to the Apriori algorithm for
3
4. A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
frequent itemset mining. It works by iterating over the transactions in the dataset and identifying the
itemsets that occur in each transaction, thereby avoiding the generation of candidate itemsets that are not
present in the dataset.
2 Market Based Modelling
Market-based modeling is a technique used in economics and business to analyze and simulate the behavior of
markets, particularly in relation to the supply and demand of goods and services. This modeling technique
involves creating mathematical models that can simulate how different market participants (consumers,
producers, and intermediaries) interact with each other in a market setting.
One of the most common market-based models is the supply and demand model, which assumes that the
price of a good or service is determined by the balance between its supply and demand. In this model, the
price of a good or service will rise if the demand for it exceeds its supply, and will fall if the supply exceeds
the demand.
Another popular market-based model is the game theory model, which is used to analyze how different
participants in a market interact with each other. Game theory models assume that market participants are
rational and act in their own self-interest, and seek to identify the strategies that each participant is likely
to adopt in a given situation.
Market-based models can be used to analyze a wide range of economic phenomena, from the pricing
of individual goods and services to the behavior of entire industries and markets. They can also be used
to test the potential impact of various policies and interventions on the behavior of markets and market
participants.
Overall, market-based modeling is a powerful tool for understanding and predicting the behavior of
markets and the economy as a whole. By creating mathematical models that simulate the behavior of
market participants and the interactions between them, economists and business analysts can gain valuable
insights into the workings of markets, and develop strategies for managing and optimizing their performance.
3 Apriori Algorithm
The Apriori algorithm is a popular algorithm used in data mining and machine learning to discover frequent
itemsets in large transactional datasets. It was proposed by Agrawal and Srikant in 1994 and is widely used
4
5. A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
in association rule mining, market basket analysis, and other data mining applications.
The Apriori algorithm uses a bottom-up approach to generate all frequent itemsets by first identifying
frequent individual items and then using those items to generate larger itemsets. The algorithm works by
performing the following steps:
ห First, the algorithm scans the entire dataset to identify all individual items and their frequency of
occurrence. This information is used to generate the initial set of frequent itemsets.
ห Next, the algorithm uses a level-wise search strategy to generate larger itemsets by combining fre-
quent itemsets from the previous level. The algorithm starts with two-itemsets and then progressively
generates larger itemsets until no more frequent itemsets can be found.
ห At each level, the algorithm prunes the search space by eliminating itemsets that cannot be frequent
based on the minimum support threshold. This is done using the Apriori principle, which states that
any subset of a frequent itemset must also be frequent.
The algorithm terminates when no more frequent itemsets can be generated or when the maximum
itemset size is reached.
Once all frequent itemsets have been identified, the Apriori algorithm can be used to generate association
rules that describe the relationships between different items in the dataset. An association rule is a statement
of the form X โ > Y, where X and Y are itemsets and X is a subset of Y. The rule indicates that there is a
strong relationship between items in X and items in Y.
The strength of an association rule is measured using two metrics: support and confidence. Support is
the percentage of transactions in the dataset that contain both X and Y, while confidence is the percentage
of transactions that contain Y given that they also contain X.
Overall, the Apriori algorithm is a powerful tool for discovering frequent itemsets and association rules
in large datasets. By identifying patterns and relationships between different items in the dataset, it can
be used to gain valuable insights into consumer behavior, market trends, and other important business and
economic phenomena.
4 Handling Large Datasets in Main Memory
Handling large datasets in main memory can be a challenging task, as the amount of memory available on
most computer systems is often limited. However, there are several techniques and strategies that can be
5
6. A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
used to effectively manage and analyze large datasets in main memory:
ห Use data compression: Data compression techniques can be used to reduce the amount of memory
required to store a dataset. Techniques such as gzip or bzip2 can compress text data, while binary
data can be compressed using libraries like LZ4 or Snappy.
ห Use data partitioning: Large datasets can be partitioned into smaller, more manageable subsets,
which can be processed and analyzed in main memory. This can be done using techniques such as
horizontal partitioning, vertical partitioning, or hybrid partitioning.
ห Use data sampling: Data sampling can be used to select a representative subset of data for analysis,
without requiring the entire dataset to be loaded into memory. Random sampling, stratified sampling,
and cluster sampling are some of the commonly used sampling techniques.
ห Use in-memory databases: In-memory databases can be used to store large datasets in main
memory for faster querying and analysis. Examples of in-memory databases include Apache Ignite,
SAP HANA, and VoltDB.
ห Use parallel processing: Parallel processing techniques can be used to distribute the processing of
large datasets across multiple processors or cores. This can be done using libraries like Apache Spark,
which provides distributed data processing capabilities.
ห Use data streaming: Data streaming techniques can be used to process large datasets in real-time
by processing data as it is generated, rather than storing it in memory. Apache Kafka, Apache Flink,
and Apache Storm are some of the popular data streaming platforms.
Overall, effective management of large datasets in main memory requires a combination of data compres-
sion, partitioning, sampling, in-memory databases, parallel processing, and data streaming techniques. By
leveraging these techniques, it is possible to effectively analyze and process large datasets in main memory,
without requiring expensive hardware upgrades or specialized software tools.
5 Limited Pass Algorithm
A limited pass algorithm is a technique used in data processing and analysis to efficiently process large
datasets with limited memory resources.
6
7. A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
In a limited pass algorithm, the dataset is processed in a fixed number of passes or iterations, where each
pass involves processing a subset of the data. The algorithm ensures that each pass is designed to capture
the relevant information needed for the analysis, while minimizing the memory required to store the data.
For example, a limited pass algorithm for processing a large text file could involve reading the file in chunks
or sections, processing each section in memory, and then discarding the processed data before moving onto
the next section. This approach enables the algorithm to handle large datasets that cannot be loaded entirely
into memory.
Limited pass algorithms are often used in situations where the data cannot be stored in main memory,
or when the processing of the data requires significant computational resources. Examples of applications
that use limited pass algorithms include text processing, machine learning, and data mining.
While limited pass algorithms can be useful for processing large datasets with limited memory resources,
they can also be less efficient than algorithms that can process the entire dataset in a single pass. Therefore,
it is important to carefully design the algorithm to ensure that it can capture the relevant information needed
for the analysis, while minimizing the number of passes required to process the data.
6 Counting Frequent Itemsets in a Stream
Counting frequent itemsets in a stream is a problem of finding the most frequent itemsets in a continuous
stream of transactions. This problem is commonly known as the Frequent Itemset Mining problem. Here
are the steps involved in counting frequent itemsets in a stream:
1. Initialize a hash table to store the counts of each itemset. The size of the hash table should be limited
to prevent it from becoming too large.
2. Read each transaction in the stream one at a time.
3. Generate all the possible itemsets from the transaction. This can be done using the Apriori algorithm,
which generates candidate itemsets by combining smaller frequent itemsets.
4. Increment the count of each itemset in the hash table.
5. Prune infrequent itemsets from the hash table. An itemset is infrequent if its count is less than a
predefined threshold.
6. Repeat steps 2-5 for each transaction in the stream.
7
8. A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
7. Output the frequent itemsets that remain in the hash table after processing all the transactions.
The main challenge in counting frequent itemsets in a stream is to keep track of the changing frequencies
of the itemsets as new transactions arrive. This can be done efficiently using the hash table to store the
counts of the itemsets. However, the hash table can become too large if the number of distinct itemsets is
too large. To prevent this, the hash table can be limited in size by using a hash function that maps each
itemset to a fixed number of hash buckets. The size of the hash table can be adjusted dynamically based on
the number of items and transactions in the stream.
Another challenge in counting frequent itemsets in a stream is to choose the threshold for the minimum
count of an itemset to be considered frequent. The threshold should be set high enough to exclude infrequent
itemsets, but low enough to include all the important frequent itemsets. The threshold can be determined
using heuristics or by using machine learning techniques to learn the optimal threshold from the data.
7 Clustering Techniques
Clustering techniques are used to group similar data points together in a dataset based on their similarity
or distance measures. Here are some popular clustering techniques:
7.1 K-Means Clustering:
This is a popular clustering algorithm that partitions a dataset into K clusters based on the mean dis-
tance of the data points to their assigned cluster centers. It involves an iterative process of assigning data
points to clusters and updating the cluster centers until convergence. K-Means is commonly used in image
segmentation, marketing, and customer segmentation.
7.1.1 K-means Clustering algorithm
K-Means clustering is a popular unsupervised machine learning algorithm that partitions a dataset into k
clusters, where k is a pre-defined number of clusters. The algorithm works as follows:
ห Initialize the k cluster centroids randomly.
ห Assign each data point to the nearest cluster centroid based on its distance.
ห Calculate the new cluster centroids based on the mean of all data points assigned to that cluster.
8
9. A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
ห Repeat steps 2-3 until the cluster centroids no longer change significantly, or a maximum number of
iterations is reached.
ห The distance metric used for step 2 is typically the Euclidean distance, but other distance metrics can
be used as well.
ห The K-Means algorithm aims to minimize the sum of squared distances between each data point and
its assigned cluster centroid. This objective function is known as the within-cluster sum of squares
(WCSS) or the sum of squared errors (SSE).
ห To determine the optimal number of clusters, a common approach is to use the elbow method. This
involves plotting the WCSS or SSE against the number of clusters and selecting the number of clusters
at the โelbowโ point, where the rate of decrease in WCSS or SSE begins to level off.
K-Means is a computationally efficient algorithm that can scale to large datasets. It is particularly useful
when the data is high-dimensional and traditional clustering algorithms may be too slow. However, K-Means
requires the number of clusters to be pre-defined and may converge to a suboptimal solution if the initial
cluster centroids are not well chosen. It is also sensitive to non-linear data and may not work well with such
data. Here are some of its advantages and disadvantages:
9
10. A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
Advantages:
ห Simple and easy to understand: K-Means is
easy to understand and implement, making it
a popular choice for clustering tasks.
ห Fast and scalable: K-Means is a computation-
ally efficient algorithm that can scale to large
datasets. It is particularly useful when the data
is high-dimensional and traditional clustering
algorithms may be too slow.
ห Works well with circular or spherical clusters:
K-Means works well with circular or spherical
clusters, making it suitable for datasets that ex-
hibit these types of shapes.
ห Provides a clear and interpretable result: K-
Means provides a clear and interpretable clus-
tering result, where each data point is assigned
to one of the k clusters.
Disadvantages:
ห Requires pre-defined number of clusters: K-
Means requires the number of clusters to be
pre-defined, which can be a challenge when the
number of clusters is unknown or difficult to
determine.
ห Sensitive to initial cluster centers: K-Means is
sensitive to the initial placement of cluster cen-
ters and can converge to a suboptimal solution
if the initial centers are not well chosen.
ห Can converge to a local minimum: K-Means
can converge to a local minimum rather than
the global minimum, resulting in a suboptimal
clustering solution.
ห Not suitable for non-linear data: K-Means as-
sumes that the data is linearly separable and
may not work well with non-linear data.
In summary, K-Means is a simple and fast clustering algorithm that works well with circular or spherical
clusters. However, it requires the number of clusters to be pre-defined and may converge to a suboptimal
solution if the initial cluster centers are not well chosen. It is also sensitive to non-linear data and may not
work well with such data.
7.2 Hierarchical Clustering:
This technique builds a hierarchy of clusters by recursively dividing or merging clusters based on their
similarity. It can be agglomerative (bottom-up) or divisive (top-down). In agglomerative clustering, each
data point starts in its own cluster, and then pairs of clusters are successively merged until all data points
belong to a single cluster. Divisive clustering starts with all data points in a single cluster and recursively
divides them into smaller clusters. Hierarchical clustering is useful in gene expression analysis, social network
10
11. A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
analysis, and image analysis.
7.3 Density-based Clustering:
This technique identifies clusters based on the density of data points. It assumes that clusters are areas of
higher density separated by areas of lower density. Density-based clustering algorithms, such as DBSCAN
(Density-Based Spatial Clustering of Applications with Noise), group together data points that are closely
packed together and separate outliers. Density-based clustering is commonly used in image processing,
geospatial data analysis, and anomaly detection.
7.4 Gaussian Mixture Models:
This technique models the distribution of data points using a mixture of Gaussian probability distributions.
Each component of the mixture represents a cluster, and the algorithm estimates the parameters of the
mixture using the Expectation-Maximization algorithm. Gaussian Mixture Models are commonly used in
image segmentation, handwriting recognition, and speech recognition.
7.5 Spectral Clustering:
This technique converts the data points into a graph and then partitions the graph into clusters based
on the eigenvalues and eigenvectors of the graph Laplacian matrix. Spectral clustering is useful in image
segmentation, community detection in social networks, and document clustering.
Each clustering technique has its own strengths and weaknesses, and the choice of clustering algorithm
depends on the nature of the data, the clustering objective, and the computational resources available.
8 Clustering high-dimensional data
Clustering high-dimensional data is a challenging task because the distance or similarity measures used in
most clustering algorithms become less meaningful in high-dimensional space. Here are some techniques for
clustering high-dimensional data:
11
12. A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
8.1 Dimensionality Reduction:
High-dimensional data can be transformed into a lower-dimensional space using dimensionality reduction
techniques, such as Principal Component Analysis (PCA) or t-SNE (t-distributed Stochastic Neighbor Em-
bedding). Dimensionality reduction can help to reduce the curse of dimensionality and make the clustering
algorithms more effective.
8.2 Feature Selection:
Not all features in high-dimensional data are equally informative. Feature selection techniques can be used
to identify the most relevant features for clustering and discard the redundant or noisy features. This can
help to improve the clustering accuracy and reduce the computational cost.
8.3 Subspace Clustering:
Subspace clustering is a clustering technique that identifies clusters in subspaces of the high-dimensional
space. This technique assumes that the data points lie in a union of subspaces, each of which represents
a cluster. Subspace clustering algorithms, such as CLIQUE (CLustering In QUEst), identify the subspaces
and clusters simultaneously.
8.4 Density-Based Clustering:
Density-based clustering algorithms, such as DBSCAN, can be used for clustering high-dimensional data by
defining the density of data points in each dimension. The clustering algorithm identifies regions of high
density in the multidimensional space, which correspond to clusters.
8.5 Ensemble Clustering:
Ensemble clustering combines multiple clustering algorithms or different parameter settings of the same
algorithm to improve the clustering performance. Ensemble clustering can help to reduce the sensitivity of
the clustering results to the choice of algorithm or parameter settings.
8.6 Deep Learning-Based Clustering:
Deep learning-based clustering techniques, such as Deep Embedded Clustering (DEC) and Autoencoder-
based Clustering (AE-Clustering), use neural networks to learn a low-dimensional representation of high-
12
13. A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
dimensional data and cluster the data in the reduced space. These techniques have shown promising results in
clustering high-dimensional data in various domains, including image analysis and gene expression analysis.
Clustering high-dimensional data requires careful consideration of the choice of clustering algorithm,
feature selection or dimensionality reduction technique, and parameter settings. A combination of different
techniques may be required to achieve the best clustering performance.
8.7 CLIQUE and ProCLUS
CLIQUE (CLustering In QUEst) and ProCLUS are two popular subspace clustering algorithms for high-
dimensional data.
CLIQUE is a density-based algorithm that works by identifying dense subspaces in the data. It assumes
that clusters exist in subspaces of the data that are dense in at least k dimensions, where k is a user-defined
parameter. The algorithm identifies all possible dense subspaces by enumerating all combinations of k
dimensions and checking if the corresponding subspaces are dense. It then merges the overlapping subspaces
to form clusters. CLIQUE is efficient for high-dimensional data because it only considers a small number of
dimensions at a time.
ProCLUS (PROjective CLUSters) is a subspace clustering algorithm that works by identifying clusters
in a low-dimensional projection of the data. It first selects a random projection matrix and projects the data
onto a lower-dimensional space. It then uses K-Means clustering to cluster the projected data. The algorithm
iteratively refines the projection matrix and re-clusters the data until convergence. The final clusters are
projected back to the original high-dimensional space. ProCLUS is effective for high-dimensional data
because it reduces the dimensionality of the data while preserving the clustering structure.
Both CLIQUE and ProCLUS are designed to handle high-dimensional data by identifying clusters in
subspaces of the data. They are effective for clustering data that have a natural subspace structure. However,
they may not work well for data that do not have a clear subspace structure or when the data points are
widely spread out in the high-dimensional space. It is important to carefully choose the appropriate algorithm
based on the characteristics of the data and the clustering objectives.
9 Frequent pattern-based clustering methods
Frequent pattern-based clustering methods combine frequent pattern mining with clustering techniques to
identify clusters based on frequent patterns in the data. Here are some examples of frequent pattern-based
13
14. A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
clustering methods:
1. Frequent Pattern-based Clustering: is a clustering algorithm that uses frequent pattern mining to
identify clusters in transactional data. The algorithm first identifies frequent itemsets in the data
using Apriori or FP-Growth algorithms. It then constructs a graph where each frequent itemset is a
node, and the edges represent the overlap between the itemsets. The graph is partitioned into clusters
using a graph clustering algorithm. The resulting clusters are then used to assign objects to clusters
based on their membership in the frequent itemsets.
2. Frequent Pattern-based Clustering Method: is a clustering algorithm that uses frequent pattern mining
to identify clusters in high-dimensional data. The algorithm first discretizes the continuous data into
categorical data. It then uses Apriori or FP-Growth algorithms to identify frequent itemsets in the
categorical data. The frequent itemsets are used to construct a binary matrix that represents the
membership of objects in the frequent itemsets. The binary matrix is clustered using a standard
clustering algorithm, such as K-Means or Hierarchical clustering. The resulting clusters are then used
to assign objects to clusters based on their membership in the frequent itemsets.
3. Clustering based on Frequent Pattern Combination: is a clustering algorithm that combines frequent
pattern mining with pattern combination techniques to identify clusters in transactional data. The
algorithm first identifies frequent itemsets in the data using Apriori or FP-Growth algorithms. It
then uses pattern combination techniques, such as Minimum Description Length (MDL) or Bayesian
Information Criterion (BIC), to generate composite patterns from the frequent itemsets. The composite
patterns are then used to construct a graph, which is partitioned into clusters using a graph clustering
algorithm.
Frequent pattern-based clustering methods are effective for identifying clusters based on frequent patterns
in the data. They can be applied to a wide range of data types, including transactional data and high-
dimensional data. However, these methods may suffer from the curse of dimensionality when applied to
high-dimensional data. It is important to carefully select the appropriate frequent pattern mining and
clustering techniques based on the characteristics of the data and the clustering objectives.
14
15. A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
10 Clustering in non-Euclidean space
Clustering in non-Euclidean space refers to the clustering of data points that are not represented in the
Euclidean space, such as graphs, time series, or text data. Traditional clustering algorithms, such as K-
Means and Hierarchical clustering, assume that the data points are represented in the Euclidean space and
use distance metrics, such as Euclidean distance or cosine similarity, to measure the similarity between data
points. However, in non-Euclidean spaces, the notion of distance is different, and distance-based clustering
methods may not be suitable.
Here are some approaches for clustering in non-Euclidean spaces:
1. Spectral clustering: Spectral clustering is a popular clustering algorithm that can be applied to data
represented in non-Euclidean spaces, such as graphs or time series. It uses the eigenvalues and eigen-
vectors of the Laplacian matrix of the data to identify clusters. Spectral clustering converts the data
points into a graph representation and then computes the Laplacian matrix of the graph. The eigen-
vectors of the Laplacian matrix are used to embed the data points into a lower-dimensional space,
where clustering is performed using a standard clustering algorithm, such as K-Means or Hierarchical
clustering.
2. Density-Based Spatial Clustering of Applications with Noise: is a density-based clustering algorithm
that can be applied to data represented in non-Euclidean spaces. It does not rely on a distance
metric and can cluster data points based on their density. DBSCAN identifies clusters by defining two
parameters: the minimum number of points required to form a cluster and a radius that determines
the neighborhood of a point. DBSCAN labels each point as either a core point, a border point, or a
noise point, based on its neighborhood. The core points are used to form clusters.
3. Topic modeling: Topic modeling is a clustering method that can be applied to text data, which is
typically represented in a non-Euclidean space. Topic modeling identifies latent topics in the text data
by analyzing the co-occurrence of words. It represents each document as a distribution over topics,
and each topic as a distribution over words. The resulting topic distribution of each document can be
used to cluster the documents based on their similarity.
Clustering in non-Euclidean spaces requires careful consideration of the appropriate algorithms and tech-
niques that are suitable for the specific data type. Spectral clustering and DBSCAN are effective for clustering
15
16. A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
data represented as graphs or time series, while topic modeling is suitable for text data. Other approaches,
such as manifold learning and kernel methods, can also be used for clustering in non-Euclidean spaces.
11 Clustering for streams and parallelism
Clustering for streams and parallelism are two important considerations for clustering large datasets. Stream
data refers to data that arrives continuously and in real-time, while parallelism refers to the ability to
distribute the clustering task across multiple computing resources.
Here are some approaches for clustering streams and parallelism:
1. Online clustering: Online clustering is a technique that can be applied to streaming data. It updates
the clustering model continuously as new data arrives. Online clustering algorithms, such as BIRCH
and CluStream, are designed to handle data streams and can scale to large datasets. These algo-
rithms incrementally update the cluster model as new data arrives and discard outdated data points
to maintain the cluster modelโs accuracy and efficiency.
2. Parallel clustering: Parallel clustering refers to the use of multiple computing resources, such as multiple
processors or computing clusters, to speed up the clustering process. Parallel clustering algorithms,
such as K-Means Parallel, Hierarchical Parallel, and DBSCAN Parallel, distribute the clustering task
across multiple computing resources. These algorithms partition the data into smaller subsets and
assign each subset to a separate computing resource. The resulting clusters are then merged to produce
the final clustering result.
3. Distributed clustering: Distributed clustering refers to the use of multiple computing resources that
are distributed across different physical locations, such as different data centers or cloud resources.
Distributed clustering algorithms, such as MapReduce and Hadoop, distribute the clustering task
across multiple computing resources and handle data that is too large to fit into a single computing
resourceโs memory. These algorithms partition the data into smaller subsets and assign each subset to
a separate computing resource. The resulting clusters are then merged to produce the final clustering
result.
Clustering for streams and parallelism requires careful consideration of the appropriate algorithms and
techniques that are suitable for the specific clustering objectives and data types. Online clustering is effective
16
17. A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
for clustering streaming data, while parallel clustering and distributed clustering can speed up the clustering
process for large datasets.
Q1: Write R function to check whether the given number is prime or not.
# Program to check if the input number is prime or not
# take input from the user
num = as.integer(readline(prompt=โEnter a number: โ))
flag = 0
# prime numbers are greater than 1 if(num ยฟ 1)
# check for factors flag = 1
for(i in 2:(num-1)) {
if ((num %% i) == 0)
flag = 0
break
}
}
}
if(num == 2) flag = 1
if(flag == 1)
print(paste(num,โis a prime numberโ))
else
print(paste(num,โis not a prime numberโ))
17
18. A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
Apriori algorithm:โThe apriori algorithm solves the frequent item sets problem. The algorithm ana-
lyzes a data set to determine which combinations of items occur together frequently. The Apriori algorithm
is at the core of various algorithms for data mining problems. The best known problem is finding the asso-
ciation rules that hold in a basket -item relation.
Numerical:
Given:
Support = 60% = 60/100 โ 5 = 3
Confidence = 70%
ITERATION:1
STEP 1: (C1)
Itemsets Counts
A 1
C 2
D 1
E 4
I 1
K 5
M 3
N 2
O 3
U 1
Y 3
STEP 2: (L2)
Itemsets Counts
E 4
K 5
M 3
O 3
Y 3
ITERATION 2:
STEP 3: (C2)
Itemsets Counts
E, K 4
E, M 2
E, O 3
E, Y 2
K, M 3
K, O 3
K, Y 3
M, O 1
M, Y 2
O, Y 2
STEP 4: (L2)
Itemsets Counts
E, K 4
E, O 3
K, M 3
K, O 3
K, Y 3
ITERATION 3:
STEP 5: (C3)
Itemsets Counts
E, K, O 3
K, M, O 1
K, M, Y 2
STEP 6: (L3)
Itemsets Counts
E, K, O 3
Now, stop since no more combinations can be made in L3.
ASSOCIATION RULE:
1. [E, K] โ O = 3/4 = 75%
2. [K, O] โ E = 3/3 = 100%
3. [E, O] โ K = 3/3 = 100%
4. E โ [K, O] = 3/4 = 75%
18
19. A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
5. K โ [E, O] = 3/5 = 60%
6. O โ [E, K] = 3/3 = 100%
Therefore, Rule no. 5 is discarded because confidence โฅ 70%
So, Rule 1,2,3,4,6 are selected.
19
20. Printed Page: 1 of 2
Subject Code: KIT601
0Roll No: 0 0 0 0 0 0 0 0 0 0 0 0 0
BTECH
(SEM VI) THEORY EXAMINATION 2021-22
DATA ANALYTICS
Time: 3 Hours Total Marks: 100
Note: Attempt all Sections. If you require any missing data, then choose suitably.
SECTION A
1. Attempt all questions in brief. 2*10 = 20
Qno Questions CO
(a) Discuss the need of data analytics. 1
(b) Give the classification of data. 1
(c) Define neural network. 2
(d) What is multivariate analysis? 2
(e) Give the full form of RTAP and discuss its application. 3
(f) What is the role of sampling data in a stream? 3
(g) Discuss the use of limited pass algorithm. 4
(h) What is the principle behind hierarchical clustering technique? 4
(i) List five R functions used in descriptive statistics. 5
(j) List the names of any 2 visualization tools. 5
SECTION B
2. Attempt any three of the following: 10*3 = 30
Qno Questions CO
(a) Explain the process model and computation model for Big data
platform.
1
(b) Explain the use and advantages of decision trees. 2
(c) Explain the architecture of data stream model. 3
(d) Illustrate the K-means algorithm in detail with its advantages. 4
(e) Differentiate between NoSQL and RDBMS databases. 5
SECTION C
3. Attempt any one part of the following: 10*1 = 10
Qno Questions CO
(a) Explain the various phases of data analytics life cycle. 1
(b) Explain modern data analytics tools in detail. 1
4. Attempt any one part of the following: 10 *1 = 10
Qno Questions CO
(a) Compare various types of support vector and kernel methods of data
analysis.
2
(b) Given data= {2,3,4,5,6,7;1,5,3,6,7,8}. Compute the principal
component using PCA algorithm.
2
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
21. Printed Page: 2 of 2
Subject Code: KIT601
0Roll No: 0 0 0 0 0 0 0 0 0 0 0 0 0
BTECH
(SEM VI) THEORY EXAMINATION 2021-22
DATA ANALYTICS
5. Attempt any one part of the following: 10*1 = 10
Qno Questions CO
(a) Explain any one algorithm to count number of distinct elements in a
data stream.
3
(b) Discuss the case study of stock market predictions in detail. 3
6. Attempt any one part of the following: 10*1 = 10
Qno Questions CO
(a) Differentiate between CLIQUE and ProCLUS clustering. 4
(b) A database has 5 transactions. Let min_sup=60% and min_conf=80%.
TID Items_Bought
T100 {M, O, N, K, E, Y}
T200 {D, O, N, K, E, Y}
T300 {M, A, K, E}
T400 {M, U, C, K, Y}
T500 {C, O, O, K, I, E}
i) Find all frequent itemsets using Apriori algorithm.
ii) List all the strong association rules (with support s and confidence
c).
4
7. Attempt any one part of the following: 10*1 = 10
Qno Questions CO
(a) Explain the HIVE architecture with its features in detail. 5
(b) Write R function to check whether the given number is prime or not. 5
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
25. DATABASE
SYSTEMS
GROUP
What is Frequent Itemset Mining?
Frequent Itemset Mining:
Finding frequent patterns, associations, correlations, or causal structures
among sets of items or objects in transaction databases, relational
databases, and other information repositories.
โข Given:
โ A set of items ๐ผ = {๐1, ๐2, โฆ , ๐๐}
โ A database of transactions ๐ท, where a transaction ๐ โ ๐ผ is a set of items
โข Task 1: find all subsets of items that occur together in many
transactions.
โ E.g.: 85% of transactions contain the itemset {milk, bread, butter}
โข Task 2: find all rules that correlate the presence of one set of items with
that of another set of items in the transaction database.
โ E.g.: 98% of people buying tires and auto accessories also get automotive service
done
โข Applications: Basket data analysis, cross-marketing, catalog design,
loss-leader analysis, clustering, classification, recommendation systems,
etc.
Frequent Itemset Mining ๏ Introduction 3
DATABASE
SYSTEMS
GROUP
Example: Basket Data Analysis
โข Transaction database
D= {{butter, bread, milk, sugar};
{butter, flour, milk, sugar};
{butter, eggs, milk, salt};
{eggs};
{butter, flour, milk, salt, sugar}}
โข Question of interest:
โ Which items are bought together frequently?
โข Applications
โ Improved store layout
โ Cross marketing
โ Focused attached mailings / add-on sales
โ * ๏ Maintenance Agreement
(What the store should do to boost Maintenance Agreement sales)
โ Home Electronics ๏ * (What other products should the store stock up?)
Frequent Itemset Mining ๏ Introduction 4
items frequency
{butter} 4
{milk} 4
{butter, milk} 4
{sugar} 3
{butter, sugar} 3
{milk, sugar} 3
{butter, milk, sugar} 3
{eggs} 2
โฆ
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
26. DATABASE
SYSTEMS
GROUP
Chapter 3: Frequent Itemset Mining
1) Introduction
โ Transaction databases, market basket data analysis
2) Mining Frequent Itemsets
โ Apriori algorithm, hash trees, FP-tree
3) Simple Association Rules
โ Basic notions, rule generation, interestingness measures
4) Further Topics
โ Hierarchical Association Rules
โข Motivation, notions, algorithms, interestingness
โ Quantitative Association Rules
โข Motivation, basic idea, partitioning numerical attributes, adaptation of
apriori algorithm, interestingness
5) Extensions and Summary
Outline 5
DATABASE
SYSTEMS
GROUP
Mining Frequent Itemsets: Basic
Notions
๏ง Items ๐ผ = {๐1, ๐2, โฆ , ๐๐} : a set of literals (denoting items)
โข Itemset ๐: Set of items ๐ โ ๐ผ
โข Database ๐ท: Set of transactions ๐, each transaction is a set of items T โ
๐ผ
โข Transaction ๐ contains an itemset ๐: ๐ โ ๐
โข The items in transactions and itemsets are sorted lexicographically:
โ itemset ๐ = (๐ฅ1, ๐ฅ2, โฆ , ๐ฅ๐ ), where ๐ฅ1 ๏ฃ ๐ฅ2 ๏ฃ โฆ ๏ฃ ๐ฅ๐
โข Length of an itemset: number of elements in the itemset
โข k-itemset: itemset of length k
โข The support of an itemset X is defined as: ๐ ๐ข๐๐๐๐๐ก ๐ = ๐ โ ๐ท|๐ โ ๐
โข Frequent itemset: an itemset X is called frequent for database ๐ท iff it is
contained in more than ๐๐๐๐๐ข๐ many transactions: ๐ ๐ข๐๐๐๐๐ก(๐) โฅ
๐๐๐๐๐ข๐
โข Goal 1: Given a database ๐ทand a threshold ๐๐๐๐๐ข๐ , find all frequent
itemsets X โ ๐๐๐ก(๐ผ).
Frequent Itemset Mining ๏ Algorithms 6
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
27. DATABASE
SYSTEMS
GROUP
Mining Frequent Itemsets: Basic Idea
โข Naรฏve Algorithm
โ count the frequency of all possible subsets of ๐ผ in the database
๏ too expensive since there are 2m such itemsets for |๐ผ| = ๐ items
โข The Apriori principle (anti-monotonicity):
Any non-empty subset of a frequent itemset is frequent, too!
A โ I with support A โฅ minSup โ โAโฒ
โ A โง Aโฒ
โ โ : support Aโฒ
โฅ minSup
Any superset of a non-frequent itemset is non-frequent, too!
A โ I with support A < minSup โ โAโฒ
โ A: support Aโฒ
< minSup
โข Method based on the apriori principle
โ First count the 1-itemsets, then the 2-itemsets,
then the 3-itemsets, and so on
โ When counting (k+1)-itemsets, only consider those
(k+1)-itemsets where all subsets of length k have been
determined as frequent in the previous step
Frequent Itemset Mining ๏ Algorithms ๏ Apriori Algorithm 7
cardinality of power set
ร
A B C D
AB AC AD BC BD CD
ABC ABD ACD BCD
ABCD not frequent
DATABASE
SYSTEMS
GROUP
The Apriori Algorithm
variable Ck: candidate itemsets of size k
variable Lk: frequent itemsets of size k
L1 = {frequent items}
for (k = 1; Lk !=๏; k++) do begin
// JOIN STEP: join Lk with itself to produce Ck+1
// PRUNE STEP: discard (k+1)-itemsets from Ck+1 that
contain non-frequent k-itemsets as subsets
Ck+1 = candidates generated from Lk
for each transaction t in database do
Increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
return ๏k Lk
Frequent Itemset Mining ๏ Algorithms ๏ Apriori Algorithm 8
produce
candidates
prove
candidates
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
28. DATABASE
SYSTEMS
GROUP
Generating Candidates (Join Step)
โข Requirements for set of all candidate ๐ + 1 -itemsets ๐ถ๐+1
โ Completeness:
Must contain all frequent ๐ + 1 -itemsets (superset property ๐ถ๐+1 ๏ ๐ฟ๐+1)
โ Selectiveness:
Significantly smaller than the set of all ๐ + 1 -subsets
โ Suppose the items are sorted by any order (e.g., lexicograph.)
โข Step 1: Joining (๐ถ๐+1 = ๐ฟ๐ โ ๐ฟ๐)
โ Consider frequent ๐-itemsets ๐ and ๐
โ ๐ and ๐ are joined if they share the same first ๐ โ 1 items
insert into Ck+1
select p.i1, p.i2, โฆ, p.ikโ1, p.ik, q.ik
from Lk : p, Lk : q
where p.i1=q.i1, โฆ, p.ik โ1 =q.ikโ1, p.ik < q.ik
Frequent Itemset Mining ๏ Algorithms ๏ Apriori Algorithm 9
p ๏ Lk=3 (A, C, F)
(A, C, F, G) ๏ Ck+1=4
q ๏ Lk=3 (A, C, G)
DATABASE
SYSTEMS
GROUP
Generating Candidates (Prune Step)
โข Step 2: Pruning (๐ฟ๐+1 = {X โ ๐ถ๐+1|๐ ๐ข๐๐๐๐๐ก ๐ โฅ ๐๐๐๐๐ข๐} )
โ Naรฏve: Check support of every itemset in ๐ถ๐+1 ๏ inefficient for huge ๐ถ๐+1
โ Instead, apply Apriori principle first: Remove candidate (k+1) -itemsets
which contain a non-frequent k -subset s, i.e., s ๏ Lk
forall itemsets c in Ck+1 do
forall k-subsets s of c do
if (s is not in Lk) then delete c from Ck+1
โข Example 1
โ L3 = {(ACF), (ACG), (AFG), (AFH), (CFG)}
โ Candidates after the join step: {(ACFG), (AFGH)}
โ In the pruning step: delete (AFGH) because (FGH) ๏ L3, i.e., (FGH) is not a
frequent 3-itemset; also (AGH) ๏ L3
๏ C4 = {(ACFG)} ๏ check the support to generate L4
Frequent Itemset Mining ๏ Algorithms ๏ Apriori Algorithm 10
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
29. DATABASE
SYSTEMS
GROUP
Apriori Algorithm โ Full Example
TID items
100 1 3 4 6
200 2 3 5
300 1 2 3 5
400 1 5 6
Frequent Itemset Mining ๏ Algorithms ๏ Apriori Algorithm 11
itemsetcount
{1} 3
{2} 2
{3} 3
{4} 1
{5} 3
{6} 2
database D
scan D
minSup=0.5 C1 itemsetcount
{1} 3
{2} 2
{3} 3
{5} 3
{6} 2
L1
๐ฟ1 โ ๐ฟ1
itemset
{1 2}
{1 3}
{1 5}
{1 6}
{2 3}
{2 5}
{2 6}
{3 5}
{3 6}
{5 6}
C2
prune C1 scan D
C2 C2 itemsetcount
{1 3} 2
{1 5} 2
{1 6} 2
{2 3} 2
{2 5} 2
{3 5} 2
L2
itemset
{1 2}
{1 3}
{1 5}
{1 6}
{2 3}
{2 5}
{2 6}
{3 5}
{3 6}
{5 6}
itemsetcount
{1 2} 1
{1 3} 2
{1 5} 2
{1 6} 2
{2 3} 2
{2 5} 2
{2 6} 0
{3 5} 2
{3 6} 1
{5 6} 1
๐ฟ2 โ ๐ฟ2
itemset
{1 3 5}
{1 3 6}
{1 5 6}
{2 3 5}
C3
prune C2
itemset
{1 3 5}
{1 3 6} โ
{1 5 6} โ
{2 3 5}
C3
scan D
itemsetcount
{1 3 5} 1
{2 3 5} 2
C3 itemsetcount
{2 3 5} 2
L3
๐ฟ3 โ ๐ฟ3
C4 is empty
DATABASE
SYSTEMS
GROUP
How to Count Supports of
Candidates?
โข Why is counting supports of candidates a problem?
โ The total number of candidates can be very huge
โ One transaction may contain many candidates
โข Method: Hash-Tree
โ Candidate itemsets are stored in a hash-tree
โ Leaf nodes of hash-tree contain lists of itemsets and their support (i.e.,
counts)
โ Interior nodes contain hash tables
โ Subset function finds all the candidates contained in a transaction
Frequent Itemset Mining ๏ Algorithms ๏ Apriori Algorithm 12
h(K) = K mod 3
e.g. for 3-Itemsets
0 1 2
0 1 2 0 1 2 0 1 2
(3 6 7) 0 1 2 (3 5 7)
(3 5 11)
(7 9 12)
(1 6 11)
(1 4 11)
(1 7 9)
(7 8 9)
(1 11 12)
(2 3 8)
(5 6 7)
0 1 2 (2 5 6)
(2 5 7)
(5 8 11)
(3 4 15) (3 7 11)
(3 4 11)
(3 4 8)
(2 4 6)
(2 7 9)
(2 4 7)
(5 7 10)
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
30. DATABASE
SYSTEMS
GROUP
Hash-Tree โ Construction
โข Searching for an itemset
โ Start at the root (level 1)
โ At level d: apply the hash function h to the d-th item in the itemset
โข Insertion of an itemset
โ search for the corresponding leaf node, and insert the itemset into that leaf
โ if an overflow occurs:
โข Transform the leaf node into an internal node
โข Distribute the entries to the new leaf nodes according to the hash
function
Frequent Itemset Mining ๏ Algorithms ๏ Apriori Algorithm 13
h(K) = K mod 3
for 3-Itemsets
0 1 2
0 1 2 0 1 2 0 1 2
(3 6 7) 0 1 2 (3 5 7)
(3 5 11)
(7 9 12)
(1 6 11)
(1 4 11)
(1 7 9)
(7 8 9)
(1 11 12)
(2 3 8)
(5 6 7)
0 1 2 (2 5 6)
(2 5 7)
(5 8 11)
(3 4 15) (3 7 11)
(3 4 11)
(3 4 8)
(2 4 6)
(2 7 9)
(2 4 7)
(5 7 10)
DATABASE
SYSTEMS
GROUP
Hash-Tree โ Counting
โข Search all candidate itemsets contained in a transaction T = (t1 t2 ... tn) for a
current itemset length of k
โข At the root
โ Determine the hash values for each item t1 t2 ... tn-k+1 in T
โ Continue the search in the resulting child nodes
โข At an internal node at level d (reached after hashing of item ๐ก๐)
โ Determine the hash values and continue the search for each item ๐ก๐ with ๐ < ๐ โค ๐ โ
๐ + ๐
โข At a leaf node
โ Check whether the itemsets in the leaf node are contained in transaction T
Frequent Itemset Mining ๏ Algorithms ๏ Apriori Algorithm 14
0 1 2
0 1 2 0 1 2 0 1 2
(3 6 7) 0 1 2 (3 5 7)
(3 5 11)
(7 9 12)
(1 6 11)
(1 4 11)
(1 7 9)
(7 8 9)
(1 11 12)
(2 3 8)
(5 6 7)
0 1 2 (2 5 6)
(2 5 7)
(5 8 11)
(3 4 15) (3 7 11)
(3 4 11)
(3 4 8)
(2 4 6)
(2 7 9)
(2 4 7)
(5 7 10)
3
9 7 3,9 7
1,7
9,12
Pruned subtrees
Tested leaf nodes
Transaction (1, 3, 7, 9, 12)
h(K) = K mod 3
in our example n=5 and k=3
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
31. DATABASE
SYSTEMS
GROUP
Is Apriori Fast Enough? โ
Performance Bottlenecks
โข The core of the Apriori algorithm:
โ Use frequent (k โ 1)-itemsets to generate candidate frequent k-itemsets
โ Use database scan and pattern matching to collect counts for the candidate
itemsets
โข The bottleneck of Apriori: candidate generation
โ Huge candidate sets:
โข 104 frequent 1-itemsets will generate 107 candidate 2-itemsets
โข To discover a frequent pattern of size 100, e.g., {a1, a2, โฆ, a100}, one
needs to generate 2100 ๏ป 1030 candidates.
โ Multiple scans of database:
โข Needs n or n+1 scans, n is the length of the longest pattern
๏ Is it possible to mine the complete set of frequent itemsets without
candidate generation?
Frequent Itemset Mining ๏ Algorithms ๏ Apriori Algorithm 15
DATABASE
SYSTEMS
GROUP
Mining Frequent Patterns Without
Candidate Generation
โข Compress a large database into a compact, Frequent-Pattern tree (FP-
tree) structure
โ highly condensed, but complete for frequent pattern mining
โ avoid costly database scans
โข Develop an efficient, FP-tree-based frequent pattern mining method
โ A divide-and-conquer methodology: decompose mining tasks into smaller
ones
โ Avoid candidate generation: sub-database test only!
โข Idea:
โ Compress database into FP-tree, retaining the itemset association
information
โ Divide the compressed database into conditional databases, each associated
with one frequent item and mine each such database separately.
Frequent Itemset Mining ๏ Algorithms ๏ FP-Tree 16
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
32. DATABASE
SYSTEMS
GROUP
Construct FP-tree from a Transaction
DB
Steps for compressing the database into a FP-tree:
1. Scan DB once, find frequent 1-itemsets (single items)
2. Order frequent items in frequency descending order
Frequent Itemset Mining ๏ Algorithms ๏ FP-Tree 17
item frequency
f 4
c 4
a 3
b 3
m 3
p 3
1&2
header table:
TID items bought
100 {f, a, c, d, g, i, m, p}
200 {a, b, c, f, l, m, o}
300 {b, f, h, j, o}
400 {b, c, k, s, p}
500 {a, f, c, e, l, p, m, n}
sort items in the order
of descending support
minSup=0.5
DATABASE
SYSTEMS
GROUP
Construct FP-tree from a Transaction
DB
Steps for compressing the database into a FP-tree:
1. Scan DB once, find frequent 1-itemsets (single items)
2. Order frequent items in frequency descending order
3. Scan DB again, construct FP-tree starting with most frequent item per transaction
Frequent Itemset Mining ๏ Algorithms ๏ FP-Tree 18
item frequency
f 4
c 4
a 3
b 3
m 3
p 3
header table:
TID items bought (ordered) frequent
items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
for each transaction only
keep its frequent items
sorted in descending
order of their frequencies
1&2
3a
for each transaction build a path in the FP-tree:
- If a path with common prefix exists:
increment frequency of nodes on this path
and append suffix
- Otherwise: create a new branch
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
33. DATABASE
SYSTEMS
GROUP
Construct FP-tree from a Transaction
DB
Steps for compressing the database into a FP-tree:
1. Scan DB once, find frequent 1-itemsets (single items)
2. Order frequent items in frequency descending order
3. Scan DB again, construct FP-tree starting with most frequent item per transaction
Frequent Itemset Mining ๏ Algorithms ๏ FP-Tree 19
item frequency head
f 4
c 4
a 3
b 3
m 3
p 3
{}
f:4 c:1
b:1
p:1
b:1
c:3
a:3
b:1
m:2
p:2 m:1
header table:
TID items bought (ordered) frequent
items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
1&2
3a
3b
header table
references the
occurrences of the
frequent items in the
FP-tree
DATABASE
SYSTEMS
GROUP
Benefits of the FP-tree Structure
โข Completeness:
โ never breaks a long pattern of any transaction
โ preserves complete information for frequent pattern mining
โข Compactness
โ reduce irrelevant informationโinfrequent items are gone
โ frequency descending ordering: more frequent items are more likely to be
shared
โ never be larger than the original database (if not count node-links and
counts)
โ Experiments demonstrate compression ratios over 100
Frequent Itemset Mining ๏ Algorithms ๏ FP-Tree 20
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
34. DATABASE
SYSTEMS
GROUP
Mining Frequent Patterns Using
FP-tree
โข General idea (divide-and-conquer)
โ Recursively grow frequent pattern path using the FP-tree
โข Method
โ For each item, construct its conditional pattern-base (prefix paths), and then
its conditional FP-tree
โ Repeat the process on each newly created conditional FP-tree โฆ
โ โฆuntil the resulting FP-tree is empty, or it contains only one path (single
path will generate all the combinations of its sub-paths, each of which is a
frequent pattern)
Frequent Itemset Mining ๏ Algorithms ๏ FP-Tree 21
DATABASE
SYSTEMS
GROUP
Major Steps to Mine FP-tree
1) Construct conditional pattern base for each node in the FP-tree
2) Construct conditional FP-tree from each conditional pattern-base
3) Recursively mine conditional FP-trees and grow frequent patterns
obtained so far
โ If the conditional FP-tree contains a single path, simply enumerate all the
patterns
Frequent Itemset Mining ๏ Algorithms ๏ FP-Tree 22
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
35. DATABASE
SYSTEMS
GROUP
Major Steps to Mine FP-tree:
Conditional Pattern Base
1) Construct conditional pattern base for each node in the FP-tree
โ Starting at the frequent header table in the FP-tree
โ Traverse FP-tree by following the link of each frequent item (dashed lines)
โ Accumulate all of transformed prefix paths of that item to form a conditional
pattern base
โข For each item its prefixes are regarded as condition for it being a suffix. These
prefixes form the conditional pattern base. The frequency of the prefixes can be
read in the node of the item.
Frequent Itemset Mining ๏ Algorithms ๏ FP-Tree 23
{}
f:4 c:1
b:1
p:1
b:1
c:3
a:3
b:1
m:2
p:2 m:1
item frequency head
f 4
c 4
a 3
b 3
m 3
p 3
header table:
item cond. pattern base
f {}
c f:3, {}
a fc:3
b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1
conditional pattern base:
DATABASE
SYSTEMS
GROUP
Properties of FP-tree for Conditional
Pattern Bases
โข Node-link property
โ For any frequent item ai, all the possible frequent patterns that contain ai
can be obtained by following ai's node-links, starting from ai's head in the
FP-tree header
โข Prefix path property
โ To calculate the frequent patterns for a node ai in a path P, only the prefix
sub-path of ai in P needs to be accumulated, and its frequency count should
carry the same count as node ai.
Frequent Itemset Mining ๏ Algorithms ๏ FP-Tree 24
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
36. DATABASE
SYSTEMS
GROUP
Major Steps to Mine FP-tree:
Conditional FP-tree
1) Construct conditional pattern base for each node in the FP-tree โ
2) Construct conditional FP-tree from each conditional pattern-base
โ The prefix paths of a suffix represent the conditional basis.
๏ They can be regarded as transactions of a database.
โ Those prefix paths whose support โฅ minSup, induce a conditional FP-tree
โ For each pattern-base
โข Accumulate the count for each item in the base
โข Construct the FP-tree for the frequent items of the pattern base
Frequent Itemset Mining ๏ Algorithms ๏ FP-Tree 25
conditional pattern base: m-conditional FP-tree
{}|m
f:3
c:3
a:3
item frequency
f 3 ..
c 3 ..
a 3 ..
b 1โ
item cond. pattern base
f {}
c f:3
a fc:3
b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1
DATABASE
SYSTEMS
GROUP
Major Steps to Mine FP-tree:
Conditional FP-tree
1) Construct conditional pattern base for each node in the FP-tree โ
2) Construct conditional FP-tree from each conditional pattern-base
Frequent Itemset Mining ๏ Algorithms ๏ FP-Tree 26
conditional pattern base:
{}|m
f:3
c:3
a:3
item cond. pattern base
f {}
c f:3
a fc:3
b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1
{}|f = {} {}|c
f:3
{}|a
f:3
c:3
{}|b = {} {}|p
c:3
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
37. DATABASE
SYSTEMS
GROUP
Major Steps to Mine FP-tree
1) Construct conditional pattern base for each node in the FP-tree โ
2) Construct conditional FP-tree from each conditional pattern-base โ
3) Recursively mine conditional FP-trees and grow frequent patterns
obtained so far
โ If the conditional FP-tree contains a single path, simply enumerate all the
patterns (enumerate all combinations of sub-paths)
Frequent Itemset Mining ๏ Algorithms ๏ FP-Tree 27
example:
m-conditional FP-tree
{}|m
f:3
c:3
a:3
All frequent patterns
concerning m
m,
fm, cm, am,
fcm, fam, cam,
fcam
just a single path
DATABASE
SYSTEMS
GROUP
FP-tree: Full Example
Frequent Itemset Mining ๏ Algorithms ๏ FP-Tree 28
item frequency head
f 4
b 3
c 3
{}
b:1
c:1
header table:
TID items bought (ordered) frequent items
100 {b, c, f} {f, b, c}
200 {a, b, c} {b, c}
300 {d, f} {f}
400 {b, c, e, f} {f, b, c}
500 {f, g} {f}
minSup=0.4
f:4
b:2
c:2
database:
item cond. pattern base
f {}
b f:2, {}
c fb:2, b:1
conditional pattern base:
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
38. DATABASE
SYSTEMS
GROUP
FP-tree: Full Example
Frequent Itemset Mining ๏ Algorithms ๏ FP-Tree 29
{}
b:1
c:1
f:4
b:2
c:2
item cond. pattern base
f {}
b f:2
c fb:2, b:1
conditional pattern base 1:
{}|f = {} {}|b
f:2
{}|c
b:1
f:2
b:2
item cond. pattern base
b f:2
f {}
conditional pattern base 2:
{}|fc = {} {}|bc
f:2
{{f}}
{{b},{fb}}
{{fc}} {{bc},{fbc}}
DATABASE
SYSTEMS
GROUP
Principles of Frequent Pattern
Growth
โข Pattern growth property
โ Let ๏ก be a frequent itemset in DB, B be ๏ก's conditional pattern base, and ๏ข
be an itemset in B. Then ๏ก ๏ ๏ข is a frequent itemset in DB iff ๏ข is frequent
in B.
โข โabcdef โ is a frequent pattern, if and only if
โ โabcde โ is a frequent pattern, and
โ โf โ is frequent in the set of transactions containing โabcde โ
Frequent Itemset Mining ๏ Algorithms ๏ FP-Tree 30
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
39. DATABASE
SYSTEMS
GROUP
0
10
20
30
40
50
60
70
80
90
100
0 0,5 1 1,5 2 2,5 3
Support threshold(%)
Run
time(sec.)
D1 FP-grow th runtime
D1 Apriori runtime
Why Is Frequent Pattern Growth
Fast?
โข Performance study in [Han, Pei&Yin โ00] shows
โ FP-growth is an order of
magnitude faster than Apriori,
and is also faster than
tree-projection
โข Reasoning
โ No candidate generation, no candidate test
โข Apriori algorithm has to proceed breadth-first
โ Use compact data structure
โ Eliminate repeated database scan
โ Basic operation is counting and FP-tree building
Frequent Itemset Mining ๏ Algorithms ๏ FP-Tree 31
Data set T25I20D10K:
T 25 avg. length of transactions
I 20 avg. length of frequent itemsets
D 10K database size (#transactions)
DATABASE
SYSTEMS
GROUP
Maximal or Closed Frequent Itemsets
โข Big challenge: database contains potentially a huge number of frequent
itemsets (especially if minSup is set too low).
โ A frequent itemset of length 100 contains 2100-1 many frequent subsets
โข Closed frequent itemset:
An itemset X is closed in a data set D if there exists no proper super-
itemset Y such that ๐ ๐ข๐๐๐๐๐ก(๐) = ๐ ๐ข๐๐๐๐๐ก(๐) in D.
โ The set of closed frequent itemsets contains complete information regarding
its corresponding frequent itemsets.
โข Maximal frequent itemset:
An itemset X is maximal in a data set D if there exists no proper super-
itemset Y such that ๐ ๐ข๐๐๐๐๐ก ๐ โฅ ๐๐๐๐๐ข๐ in D.
โ The set of maximal itemsets does not contain the complete support
information
โ More compact representation
Frequent Itemset Mining ๏ Algorithms ๏ Maximal or Closed Frequent Itemsets 32
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
40. DATABASE
SYSTEMS
GROUP
Chapter 3: Frequent Itemset Mining
1) Introduction
โ Transaction databases, market basket data analysis
2) Mining Frequent Itemsets
โ Apriori algorithm, hash trees, FP-tree
3) Simple Association Rules
โ Basic notions, rule generation, interestingness measures
4) Further Topics
โ Hierarchical Association Rules
โข Motivation, notions, algorithms, interestingness
โ Quantitative Association Rules
โข Motivation, basic idea, partitioning numerical attributes, adaptation of
apriori algorithm, interestingness
5) Extensions and Summary
Outline 33
DATABASE
SYSTEMS
GROUP
Simple Association Rules:
Introduction
โข Transaction database:
D= {{butter, bread, milk, sugar};
{butter, flour, milk, sugar};
{butter, eggs, milk, salt};
{eggs};
{butter, flour, milk, salt, sugar}}
โข Frequent itemsets:
โข Question of interest:
โ If milk and sugar are bought, will the customer always buy butter as well?
๐๐๐๐, ๐ ๐ข๐๐๐ โ ๐๐ข๐ก๐ก๐๐ ?
โ In this case, what would be the probability of buying butter?
Frequent Itemset Mining ๏ Simple Association Rules 34
items support
{butter} 4
{milk} 4
{butter, milk} 4
{sugar} 3
{butter, sugar} 3
{milk, sugar} 3
{butter, milk, sugar} 3
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
41. DATABASE
SYSTEMS
GROUP
Simple Association Rules: Basic
Notions
๏ง Items ๐ผ = {๐1, ๐2, โฆ , ๐๐} : a set of literals (denoting items)
โข Itemset ๐: Set of items ๐ โ ๐ผ
โข Database ๐ท: Set of transactions ๐, each transaction is a set of items T โ ๐ผ
โข Transaction ๐ contains an itemset ๐: ๐ โ ๐
โข The items in transactions and itemsets are sorted lexicographically:
โ itemset ๐ = (๐ฅ1, ๐ฅ2, โฆ , ๐ฅ๐ ), where ๐ฅ1 ๏ฃ ๐ฅ2 ๏ฃ โฆ ๏ฃ ๐ฅ๐
โข Length of an itemset: cardinality of the itemset (k-itemset: itemset of length
k)
โข The support of an itemset X is defined as: ๐ ๐ข๐๐๐๐๐ก ๐ = ๐ โ ๐ท|๐ โ ๐
โข Frequent itemset: an itemset X is called frequent iff ๐ ๐ข๐๐๐๐๐ก(๐) โฅ ๐๐๐๐๐ข๐
โข Association rule: An association rule is an implication of the form ๐ โ ๐
where ๐, ๐ โ ๐ผ are two itemsets with ๐ โฉ ๐ = โ .
โข Note: simply enumerating all possible association rules is not reasonable!
๏ What are the interesting association rules w.r.t. ๐ท?
Frequent Itemset Mining ๏ Simple Association Rules 35
DATABASE
SYSTEMS
GROUP
Interestingness of Association Rules
โข Interestingness of an association rule:
Quantify the interestingness of an association rule with respect to a
transaction database D:
โ Support: frequency (probability) of the entire rule with respect to D
๐ ๐ข๐๐๐๐๐ก ๐ โ ๐ = ๐ ๐ โช ๐ =
{๐ โ ๐ท|๐ โช ๐ โ ๐}
๐ท
= ๐ ๐ข๐๐๐๐๐ก(๐ โช ๐)
โprobability that a transaction in ๐ท contains the itemset ๐ โช ๐โ
โ Confidence: indicates the strength of implication in the rule
๐๐๐๐๐๐๐๐๐๐ ๐ โ ๐ = ๐ ๐|๐ =
{๐ โ ๐ท|๐ โช ๐ โ ๐}
{๐ โ ๐ท|๐ โ ๐}
=
๐ ๐ข๐๐๐๐๐ก(๐ โช ๐)
๐ ๐ข๐๐๐๐๐ก(๐)
โconditional probability that a transaction in ๐ท containing the itemset ๐ also
contains itemset ๐โ
โ Rule form: โ๐ต๐๐๐ฆ โ ๐ป๐๐๐ [๐ ๐ข๐๐๐๐๐ก, ๐๐๐๐๐๐๐๐๐๐]โ
โข Association rule examples:
โ buys diapers ๏ buys beers [0.5%, 60%]
โ major in CS โง takes DB ๏ avg. grade A [1%, 75%]
Frequent Itemset Mining ๏ Simple Association Rules 36
buys beer
buys diapers
buys both
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
42. DATABASE
SYSTEMS
GROUP
Mining of Association Rules
โข Task of mining association rules:
Given a database ๐ท, determine all association rules having a ๐ ๐ข๐๐๐๐๐ก โฅ
๐๐๐๐๐ข๐ and a ๐๐๐๐๐๐๐๐๐๐ โฅ ๐๐๐๐ถ๐๐๐ (so-called strong association
rules).
โข Key steps of mining association rules:
1) Find frequent itemsets, i.e., itemsets that have at least support = ๐๐๐๐๐ข๐
2) Use the frequent itemsets to generate association rules
โข For each itemset ๐ and every nonempty subset Y โ ๐ generate rule Y โ (๐ โ
๐) if ๐๐๐๐๐ข๐ and ๐๐๐๐ถ๐๐๐ are fulfilled
โข we have 2|๐|
โ 2 many association rule candidates for each itemset ๐
โข Example
frequent itemsets
rule candidates: A โ ๐ต; ๐ต โ ๐ด; A โ ๐ถ; ๐ถ โ A; ๐ต โ ๐ถ; C โ ๐ต;
๐ด, ๐ต โ ๐ถ; ๐ด, ๐ถ โ ๐ต; ๐ถ, ๐ต โ ๐ด; ๐ด โ ๐ต, ๐ถ; ๐ต โ ๐ด, ๐ถ; ๐ถ โ ๐ด, ๐ต
Frequent Itemset Mining ๏ Simple Association Rules 37
1-itemset count 2-itemset count 3-itemset count
{A}
{B}
{C}
3
4
5
{A, B}
{A, C}
{B, C}
3
2
4
{A, B, C} 2
DATABASE
SYSTEMS
GROUP
Generating Rules from Frequent
Itemsets
โข For each frequent itemset X
โ For each nonempty subset Y of X, form a rule Y โ (๐ โ ๐)
โ Delete those rules that do not have minimum confidence
Note: 1) support always exceeds ๐๐๐๐๐ข๐
2) the support values of the frequent itemsets suffice to calculate the
confidence
โข Example: ๐ = {๐ด, ๐ต, ๐ถ}, ๐๐๐๐ถ๐๐๐ = 60%
โ conf (A ๏ B) = 3/3; โ
โ conf (B ๏ A) = 3/4; โ
โ conf (A ๏ C) = 2/3; โ
โ conf (C ๏ A) = 2/5; โ
โ conf (B ๏ C) = 4/4; โ
โ conf (C ๏ B) = 4/5; โ
โ conf (A ๏ B, C) = 2/3; โ conf (B, C ๏ A) = ยฝ โ
โ conf (B ๏ A, C) = 2/4; โ conf (A, C ๏ B) = 1 โ
โ conf (C ๏ A, B) = 2/5; โ conf (A, B ๏ C) = 2/3 โ
โข Exploit anti-monotonicity for generating candidates for strong
association rules!
Frequent Itemset Mining ๏ Simple Association Rules 38
itemset count
{A}
{B}
{C}
3
4
5
{A, B}
{A, C}
{B, C}
3
2
4
{A, B, C} 2
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
43. DATABASE
SYSTEMS
GROUP
Interestingness Measurements
โข Objective measures
โ Two popular measurements:
โ support and
โ confidence
โข Subjective measures [Silberschatz & Tuzhilin, KDD95]
โ A rule (pattern) is interesting if it is
โ unexpected (surprising to the user) and/or
โ actionable (the user can do something with it)
Frequent Itemset Mining ๏ Simple Association Rules 39
DATABASE
SYSTEMS
GROUP
Criticism to Support and Confidence
Example 1 [Aggarwal & Yu, PODS98]
โข Among 5000 students
โ 3000 play basketball (=60%)
โ 3750 eat cereal (=75%)
โ 2000 both play basket ball and eat cereal (=40%)
โข Rule play basketball ๏ eat cereal [40%, 66.7%] is misleading because
the overall percentage of students eating cereal is 75% which is higher
than 66.7%
โข Rule play basketball ๏ not eat cereal [20%, 33.3%] is far more
accurate, although with lower support and confidence
โข Observation: play basketball and eat cereal are negatively correlated
๏ Not all strong association rules are interesting and some can be
misleading.
๏ augment the support and confidence values with interestingness
measures such as the correlation ๐ด โ ๐ต [๐ ๐ข๐๐, ๐๐๐๐, ๐๐๐๐]
Frequent Itemset Mining ๏ Simple Association Rules 40
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
44. DATABASE
SYSTEMS
GROUP
Other Interestingness Measures:
Correlation
โข Lift is a simple correlation measure between two items A and B:
! The two rules ๐ด โ ๐ต and ๐ต โ ๐ด have the same correlation coefficient.
โข take both P(A) and P(B) in consideration
โข ๐๐๐๐๐ด,๐ต > 1 the two items A and B are positively correlated
โข ๐๐๐๐๐ด,๐ต = 1 there is no correlation between the two items A and B
โข ๐๐๐๐๐ด,๐ต < 1 the two items A and B are negatively correlated
Frequent Itemset Mining ๏ Simple Association Rules 41
๐๐๐๐๐ด,๐ต =
๐(๐ด โซฺโฌ ๐ต)
๐ ๐ด ๐(๐ต)
=
๐ ๐ต ๐ด )
๐ ๐ต
=
๐๐๐๐(๐ดโ๐ต)
๐ ๐ข๐๐(๐ต)
DATABASE
SYSTEMS
GROUP
Other Interestingness Measures:
Correlation
โข Example 2:
โข X and Y: positively correlated
โข X and Z: negatively related
โข support and confidence of X=>Z dominates
โข but items X and Z are negatively correlated
โข Items X and Y are positively correlated
Frequent Itemset Mining ๏ Simple Association Rules 42
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
rule support confidence correlation
๐ โ ๐ 25% 50% 2
๐ โ ๐ 37.5% 75% 0.86
๐ โ ๐ 12.5% 50% 0.57
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
45. DATABASE
SYSTEMS
GROUP
Chapter 3: Frequent Itemset Mining
1) Introduction
โ Transaction databases, market basket data analysis
2) Mining Frequent Itemsets
โ Apriori algorithm, hash trees, FP-tree
3) Simple Association Rules
โ Basic notions, rule generation, interestingness measures
4) Further Topics
โ Hierarchical Association Rules
โข Motivation, notions, algorithms, interestingness
โ Quantitative Association Rules
โข Motivation, basic idea, partitioning numerical attributes, adaptation of
apriori algorithm, interestingness
5) Extensions and Summary
Outline 43
DATABASE
SYSTEMS
GROUP
Hierarchical Association Rules:
Motivation
โข Problem of association rules in plain itemsets
โ High minsup: apriori finds only few rules
โ Low minsup: apriori finds unmanagably many rules
โข Exploit item taxonomies (generalizations, is-a hierarchies) which exist
in many applications
โข New task: find all generalized association rules between generalized
items ๏ Body and Head of a rule may have items of any level of the
hierarchy
โข Generalized association rule: ๐ โ ๐
with ๐, ๐ โ ๐ผ, ๐ โฉ ๐ = โ and no item in ๐ is an ancestor of any item in ๐
i.e., ๐๐๐๐๐๐ก๐ โ ๐๐๐๐กโ๐๐ is essentially true
Frequent Itemset Mining ๏ Further Topics ๏ Hierarchical Association Rules 44
shoes
sports shoes boots
outerwear
jackets jeans
clothes
shirts
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
46. DATABASE
SYSTEMS
GROUP
Hierarchical Association Rules:
Motivating Example
โข Examples
Jeans ๏ boots
jackets ๏ boots
Outerwear ๏ boots Support > minsup
โข Characteristics
โ Support(โouterwear ๏ bootsโ) is not necessarily equal to the sum
support(โjackets ๏ bootsโ) + support( โjeans ๏ bootsโ)
e.g. if a transaction with jackets, jeans and boots exists
โ Support for sets of generalizations (e.g., product groups) is higher
than support for sets of individual items
If the support of rule โouterwear ๏ bootsโ exceeds minsup, then the
support of rule โclothes ๏ bootsโ does, too
Frequent Itemset Mining ๏ Further Topics ๏ Hierarchical Association Rules 45
Support < minSup
DATABASE
SYSTEMS
GROUP
Mining Multi-Level Associations
โข A top_down, progressive deepening approach:
โ First find high-level strong rules:
โข milk ๏ bread [20%, 60%].
โ Then find their lower-level โweakerโ rules:
โข 1.5% milk ๏ wheat bread [6%, 50%].
โข Different min_support threshold across multi-levels lead to different
algorithms:
โ adopting the same min_support across multi-levels
โ adopting reduced min_support at lower levels
Frequent Itemset Mining ๏ Further Topics ๏ Hierarchical Association Rules 46
Food
bread
milk
3.5%
Sunset
Fraser
1.5% white
wheat
Wonder
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
47. DATABASE
SYSTEMS
GROUP
Minimum Support for Multiple Levels
โข Uniform Support
+ the search procedure is simplified (monotonicity)
+ the user is required to specify only one support threshold
โข Reduced Support
(Variable Support)
+ takes the lower frequency of items in lower levels into consideration
Frequent Itemset Mining ๏ Further Topics ๏ Hierarchical Association Rules 47
minsup = 5 %
minsup = 5 %
milk
support = 10 %
3.5%
support = 6 %
1.5%
support = 4 %
milk
support = 10 %
3.5%
support = 6 %
1.5%
support = 4 %
minsup = 3 %
minsup = 5 %
DATABASE
SYSTEMS
GROUP
Multilevel Association Mining using
Reduced Support
โข A top_down, progressive deepening approach:
โ First find high-level strong rules:
โข milk ๏ bread [20%, 60%].
โ Then find their lower-level โweakerโ rules:
โข 1.5% milk ๏ wheat bread [6%, 50%].
3 approaches using reduced Support:
โข Level-by-level independent method:
โ Examine each node in the hierarchy, regardless of whether or not its parent
node is found to be frequent
โข Level-cross-filtering by single item:
โ Examine a node only if its parent node at the preceding level is frequent
โข Level-cross- filtering by k-itemset:
โ Examine a k-itemset at a given level only if its parent k-itemset at the
preceding level is frequent
Frequent Itemset Mining ๏ Further Topics ๏ Hierarchical Association Rules 48
Food
bread
milk
3.5%
Sunset
Fraser
1.5% white
wheat
Wonder
level-wise processing (breadth first)
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
48. DATABASE
SYSTEMS
GROUP
Multilevel Associations: Variants
โข A top_down, progressive deepening approach:
โ First find high-level strong rules:
โข milk ๏ bread [20%, 60%].
โ Then find their lower-level โweakerโ rules:
โข 1.5% milk ๏ wheat bread [6%, 50%].
โข Variations at mining multiple-level association rules.
โ Level-crossed association rules:
โข 1.5 % milk ๏ Wonder wheat bread
โ Association rules with multiple, alternative hierarchies:
โข 1.5 % milk ๏ Wonder bread
Frequent Itemset Mining ๏ Further Topics ๏ Hierarchical Association Rules 49
Food
bread
milk
3.5%
Sunset
Fraser
1.5% white
wheat
Wonder
level-wise processing (breadth first)
DATABASE
SYSTEMS
GROUP
Multi-level Association: Redundancy
Filtering
โข Some rules may be redundant due to โancestorโ relationships between
items.
โข Example
โ ๐ 1: milk ๏ wheat bread [support = 8%, confidence = 70%]
โ ๐ 2: 1.5% milk ๏ wheat bread [support = 2%, confidence = 72%]
โข We say that rule 1 is an ancestor of rule 2.
โข Redundancy:
A rule is redundant if its support is close to the โexpectedโ value, based
on the ruleโs ancestor.
Frequent Itemset Mining ๏ Further Topics ๏ Hierarchical Association Rules 50
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
49. DATABASE
SYSTEMS
GROUP
Interestingness of Hierarchical
Association Rules: Notions
Let ๐, ๐โฒ, ๐, ๐โฒ โ ๐ผ be itemsets.
โข An itemset ๐โฒ is an ancestor of ๐ iff there exist ancestors ๐ฅ1
โฒ
, โฆ , ๐ฅ๐
โฒ
of
๐ฅ1, โฆ , ๐ฅ๐ โ ๐ and ๐ฅ๐+1, โฆ , ๐ฅ๐ with ๐ = ๐ such that
๐โฒ = {๐ฅ1
โฒ
, โฆ , ๐ฅ๐
โฒ
, ๐ฅ๐+1, โฆ , ๐ฅ๐}.
โข Let ๐โฒ
and ๐โฒ be ancestors of ๐ and ๐. Then we call the rules ๐โฒ ๏ ๐โฒ,
๐๏๐โฒ, and ๐โฒ๏๐ ancestors of the rule X ๏ Y .
โข The rule Xยด ๏ Yยด is a direct ancestor of rule X ๏ Y in a set of rules if:
โ Rule Xยด ๏ Yโ is an ancestor of rule X ๏ Y, and
โ There is no rule Xโ ๏ Yโ such that Xโ ๏ Yโ is an ancestor of
X ๏ Y and Xยด ๏ Yยด is an ancestor of Xโ ๏ Yโ
โข A hierarchical association rule X ๏ Y is called R-interesting if:
โ There are no direct ancestors of X ๏ Y or
โ The actual support is larger than R times the expected support or
โ The actual confidence is larger than R times the expected confidence
Frequent Itemset Mining ๏ Further Topics ๏ Hierarchical Association Rules 51
DATABASE
SYSTEMS
GROUP
Expected Support and Expected
Confidence
โข How to compute the expected support?
Given the rule for X ๏ Y and its ancestor rule Xยด ๏ Yยด the expected
support of X ๏ Y is defined as:
๐ธ๐โฒ P ๐ =
P(๐ง1)
P(๐ง1
โฒ
)
ร โฏ ร
P ๐ง๐
P(๐ง๐
โฒ
)
ร P ๐โฒ
where ๐ = ๐ โช ๐ = {๐ง1, โฆ , ๐ง๐}, ๐โฒ
= ๐โฒ
โช ๐โฒ
= {๐ง1
โฒ
, โฆ , ๐ง๐
โฒ
, ๐ง๐+1, โฆ , ๐ง๐} and
each ๐ง๐
โฒ
โ ๐โฒ
is an ancestor of ๐ง๐ โ ๐
Frequent Itemset Mining ๏ Further Topics ๏ Hierarchical Association Rules 52
[SAโ95] R. Srikant, R. Agrawal: Mining Generalized Association Rules. In VLDB, 1995.
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
50. DATABASE
SYSTEMS
GROUP
Expected Support and Expected
Confidence
โข How to compute the expected confidence?
Given the rule for X ๏ Y and its ancestor rule Xยด ๏ Yยด, then the
expected confidence of X ๏ Y is defined as:
๐ธ๐โฒโ๐โฒ P ๐|๐ =
P(๐ฆ1)
P(๐ฆ1
โฒ
)
ร โฏ ร
P ๐ฆ๐
P ๐ฆ๐
โฒ
ร P ๐โฒ|๐โฒ
where ๐ = {๐ฆ1, โฆ , ๐ฆ๐} and ๐โฒ = ๐ฆ1
โฒ
, โฆ , ๐ฆ๐
โฒ
, ๐ฆ๐+1, โฆ , ๐ฆ๐ and each ๐ฆ๐
โฒ
โ ๐โฒ is
an ancestor of ๐ฆ๐ โ ๐
Frequent Itemset Mining ๏ Further Topics ๏ Hierarchical Association Rules 53
[SAโ95] R. Srikant, R. Agrawal: Mining Generalized Association Rules. In VLDB, 1995.
DATABASE
SYSTEMS
GROUP
Interestingness of Hierarchical
Association Rules:Example
โข Example
โ Let R = 1.6
โข
Frequent Itemset Mining ๏ Further Topics ๏ Hierarchical Association Rules 54
Item Support
clothes 20
outerwear 10
jackets 4
No rule support R-interesting?
1 clothes ๏ shoes 10 yes: no ancestors
2 outerwear ๏ shoes 9 yes:
Support > R *exp. support (wrt. rule 1) =
(1.6 โ (
10
20
โ 10)) = 8
3 jackets ๏ shoes 4 Not wrt. support:
Support > R * exp. support (wrt. rule 1) = 3.2
Support < R * exp. support (wrt. rule 2) = 5.75
๏ still need to check the confidence!
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
51. DATABASE
SYSTEMS
GROUP
Chapter 3: Frequent Itemset Mining
1) Introduction
โ Transaction databases, market basket data analysis
2) Simple Association Rules
โ Basic notions, rule generation, interestingness measures
3) Mining Frequent Itemsets
โ Apriori algorithm, hash trees, FP-tree
4) Further Topics
โ Hierarchical Association Rules
โข Motivation, notions, algorithms, interestingness
โ Multidimensional and Quantitative Association Rules
โข Motivation, basic idea, partitioning numerical attributes, adaptation of
apriori algorithm, interestingness
5) Summary
Outline 55
DATABASE
SYSTEMS
GROUP
Multi-Dimensional Association:
Concepts
โข Single-dimensional rules:
โ buys milk ๏ buys bread
โข Multi-dimensional rules: ๏ณ 2 dimensions
โ Inter-dimension association rules (no repeated dimensions)
โข age between 19-25 ๏ status is student ๏ buys coke
โ hybrid-dimension association rules (repeated dimensions)
โข age between 19-25 ๏ buys popcorn ๏ buys coke
Frequent Itemset Mining ๏ Extensions & Summary 56
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
52. DATABASE
SYSTEMS
GROUP
Techniques for Mining Multi-
Dimensional Associations
โข Search for frequent k-predicate set:
โ Example: {age, occupation, buys} is a 3-predicate set.
โ Techniques can be categorized by how age is treated.
1. Using static discretization of quantitative attributes
โ Quantitative attributes are statically discretized by using predefined concept
hierarchies.
2. Quantitative association rules
โ Quantitative attributes are dynamically discretized into โbinsโbased on the
distribution of the data.
3. Distance-based association rules
โ This is a dynamic discretization process that considers the distance between
data points.
Frequent Itemset Mining ๏ Extensions & Summary 57
DATABASE
SYSTEMS
GROUP
Quantitative Association Rules
โข Up to now: associations of boolean attributes only
โข Now: numerical attributes, too
โข Example:
โ Original database
โ Boolean database
Frequent Itemset Mining ๏ Further Topics ๏ Quantitative Association Rules 58
ID age marital status # cars
1 23 single 0
2 38 married 2
ID age: 20..29 age: 30..39 m-status: single m-status: married . . .
1 1 0 1 0 . . .
2 0 1 0 1 . . .
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
53. DATABASE
SYSTEMS
GROUP
Quantitative Association Rules: Ideas
โข Static discretization
โ Discretization of all attributes before mining the association rules
โ E.g. by using a generalization hierarchy for each attribute
โ Substitute numerical attribute values by ranges or intervals
โข Dynamic discretization
โ Discretization of the attributes during association rule mining
โ Goal (e.g.): maximization of confidence
โ Unification of neighboring association rules to a generalized rule
Frequent Itemset Mining ๏ Further Topics ๏ Quantitative Association Rules 59
DATABASE
SYSTEMS
GROUP
Partitioning of Numerical Attributes
โข Problem: Minimum support
โ Too many intervals ๏ฎ๏ too small support for each individual interval
โ Too few intervals ๏ฎ too small confidence of the rules
โข Solution
โ First, partition the domain into many intervals
โ Afterwards, create new intervals by merging adjacent interval
โข Numeric attributes are dynamically discretized such that the confidence
or compactness of the rules mined is maximized.
Frequent Itemset Mining ๏ Further Topics ๏ Quantitative Association Rules 60
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
54. DATABASE
SYSTEMS
GROUP
Quantitative Association Rules
โข 2-D quantitative association rules: Aquan1 ๏ Aquan2 ๏ Acat
โข Cluster โadjacentโ association
rules to form general rules
using a 2-D grid.
โข Example:
Frequent Itemset Mining ๏ Further Topics ๏ Quantitative Association Rules 61
age(X,โ30-34โ) ๏ income(X,โ24K - 48Kโ)
๏ buys(X,โhigh resolution TVโ)
DATABASE
SYSTEMS
GROUP
Chapter 3: Frequent Itemset Mining
1) Introduction
โ Transaction databases, market basket data analysis
2) Mining Frequent Itemsets
โ Apriori algorithm, hash trees, FP-tree
3) Simple Association Rules
โ Basic notions, rule generation, interestingness measures
4) Further Topics
โ Hierarchical Association Rules
โข Motivation, notions, algorithms, interestingness
โ Quantitative Association Rules
โข Motivation, basic idea, partitioning numerical attributes, adaptation of
apriori algorithm, interestingness
5) Summary
Outline 62
A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
55. A
p
r
i
l
1
2
,
2
0
2
4
/
D
r
.
R
S
12 Reference
[1] https://www.jigsawacademy.com/blogs/hr-analytics/data-analytics-lifecycle/
[2] https://statacumen.com/teach/ADA1/ADA1_notes_F14.pdf
[3] https://www.youtube.com/watch?v=fDRa82lxzaU
[4] https://www.investopedia.com/terms/d/data-analytics.asp
[5] http://egyankosh.ac.in/bitstream/123456789/10935/1/Unit-2.pdf
[6] http://epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/computer_science/16._d
ata_analytics/03._evolution_of_analytical_scalability/et/9280_et_3_et.pdf
[7] https://bhavanakhivsara.files.wordpress.com/2018/06/data-science-and-big-data-analy
-nieizv_book.pdf
[8] https://www.researchgate.net/publication/317214679_Sentiment_Analysis_for_Effect
ive_Stock_Market_Prediction
[9] https://snscourseware.org/snscenew/files/1569681518.pdf
[10] http://csis.pace.edu/ctappert/cs816-19fall/books/2015DataScience&BigDataAnalytics.
pdf
[11] https://www.youtube.com/watch?v=mccsmoh2_3c
[12] https://mentalmodels4life.net/2015/11/18/agile-data-science-applying-kanban-in-the-a
nalytics-life-cycle/
[13] https://www.sas.com/en_in/insights/big-data/what-is-big-data.html#:~:text=Big%20dat
a%20refers%20to%20data,around%20for%20a%20long%20time.
[14] https://www.javatpoint.com/big-data-characteristics
[15] Liu, S., Wang, M., Zhan, Y., & Shi, J. (2009). Daily work stress and alcohol use: Testing the cross-
level moderation effects of neuroticism and job involvement. Personnel Psychology,62(3), 575โ597.
http://dx.doi.org/10.1111/j.1744-6570.2009.01149.x
55