 Data mining: semi-automatic discovery of interesting patterns from large
data sets
 Knowledge discovery is a process:
◦ Preprocessing
◦ Data mining
◦ Postprocessing
Summary: What is Data
Mining?
Data mining
Input data Results
Preprocessing Postprocessing
Operational
Database
Utilization
Cleaned
Verified
Focused
Eval. of
interes-
tingness
Raw
data
Time
based
selection
Selected
usable
patterns
1 3
2
Summary: Typical KDD
Process
Basics
FP Growth algorithm
Examples
Apriori Algorithm
Applications
Association Rules
• A set of items together is called an itemset. If any itemset has k- items it is
called a k-itemset.
• An itemset consists of two or more items. An itemset that occurs frequently
is called a frequent itemset. Thus frequent itemset mining is a data
mining technique to identify the items that often occur together.
• For Example, Bread and butter, Laptop and Antivirus software, etc.
• A set of items is called frequent if it satisfies a minimum threshold value for
support and confidence.
• Support shows transactions with items purchased together in a single transaction.
• Confidence shows transactions where the items are purchased one after the other.
• For frequent itemset mining method, we consider only those transactions which
meet minimum threshold support and confidence requirements.
• Insights from these mining algorithms offer a lot of benefits, cost-cutting and
improved competitive advantage.
 The frequent pattern mining algorithm is one of the most important techniques of
data mining to discover relationships between different items in a dataset.
 FPM has many applications in the field of data analysis, software bugs, cross-
marketing, sale campaign analysis, market basket analysis, etc.
 Frequent itemsets discovered through Apriori have many applications in data mining
tasks. Tasks such as finding interesting patterns in the database, finding out sequence
and Mining of association rules is the most important of them.
 Association rules apply to supermarket transaction data, that is, to examine the
customer behavior in terms of the purchased products. Association rules describe how
often the items are purchased together.
 This is the most typical example of association mining. Data is collected
using barcode scanners in most supermarkets.
 This database, known as the “market basket” database, consists of a
large number of records on past transactions.
 A single record lists all the items bought by a customer in one sale.
Knowing which groups are inclined towards which set of items gives
these shops the freedom to adjust the store layout and the store
catalogue to place the optimally concerning one another.
When people buy green tea, it is evident that they may also buy honey with it. This relationship is depicted as a
conditional algorithm, as given below.
IF {green tea} THEN {honey}
It represents that items stated on the right are more likely to be ordered with the items on the left side. Market basket
analysis in data mining helps us understand that relationship and how helpful it would be to alter our decisions based
on the analysis.
 Typical representation formats for association rules:
◦ Green Tea  Honey[0.5%, 60%]
◦ buys: Green Tea  buys: Honey [0.5%, 60%]
◦ Green Tea and Honey are bought together in 0.5% of the rows
in the database."
◦ "IF buys Green Tea , THEN buys Honey in 60% of the cases.
Green Tea  Honey [0.5%, 60%]
1 Antecedent, left-hand side (LHS), body
2 Consequent, right-hand side (RHS), head
3 Support, frequency ("in how big part of the data the things
in left- and right-hand sides occur together")
4 Confidence, strength ("if the left-hand side occurs, how
likely the right-hand side occurs")
"IF buys Green Tea ,
THEN buys Honey
in 60% of the cases
in 0.5% of the rows"
1 2 3 4
A=>B
Support is a measure of how frequently an itemset (A) appears in a dataset. It
indicates the proportion of transactions or records in the dataset that contain
the item set.
Confidence: denotes probability that a transaction containing A also contains B.
Support(X) = (Number of transactions containing X) / (Total number
of transactions)
Confidence(X => Y) = (Number of transactions containing X and Y) /
(Number of transactions containing X)
 Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding frequent
item sets in a dataset for Boolean association rule.
 Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset
properties.
 Apriori algorithm refers to the algorithm which is used to calculate the association rules
between objects.
 This algorithm uses two steps “join” and “prune” to reduce the search space. It is an
iterative approach to discover the most frequent item sets.
 It means how two or more objects are related to one another. In other words, we can
say that the Apriori algorithm is an association rule leaning that analyses that people
who bought product A also bought product.
 Method:
◦ Initially, scan DB once to get frequent 1-itemset.
◦ Generate length (k+1) candidate itemsets from length k frequent
itemsets.
◦ Test the MINIMUM SUPPORT.
◦ Terminate when no frequent or candidate set can be generated
• First, the set of frequent 1-itemsets is found by scanning the database to
accumulate the count for each item, and collecting those items that
satisfy minimum support. The resulting set is denoted by L1.
• Next, L1 is used to find L2 , the set of frequent 2-itemsets, which is used
to find L3, and so on, until no more frequent k-itemsets can be found.
The finding of each Lk requires one full scan of the database.
Find the frequent itemsets and generate association rules on this. Assume that
minimum support threshold (s = 33.33%) and minimum confident threshold (c
= 60%)
Let’s start,
There is only one itemset with minimum support 2. So only one itemset is
Frequent Itemset (I) = {Coke, Chips}
Advantages of Apriori Algorithm
• It is used to calculate large item sets.
• Simple to understand and apply.
Disadvantages of Apriori Algorithms
• Apriori algorithm is an expensive method to find support since the
calculation has to pass through the whole database.
• Sometimes, you need a huge number of candidate rules, so it
becomes computationally more expensive.
 Multiple database scans are costly
 Mining long patterns needs many passes of scanning and generates
lots of candidates
◦ To find frequent itemset i1i2…i100
 # of scans: 100
 Bottleneck: candidate-generation-and-test
 Can we avoid candidate generation?
 The FP-Growth Algorithm is an alternative way to find frequent
item sets without using candidate generations, thus improving
performance.
 For so much, it uses a divide-and-conquer strategy. The core of
this method is the usage of a special data structure named
frequent-pattern tree (FP-tree), which retains the item set
association information.
The FP Growth algorithm in data mining is a popular method for
frequent pattern mining. The algorithm is efficient for mining frequent
item sets in large datasets.
 It works by constructing a frequent pattern tree (FP-tree) from the
input dataset.
FP Growth algorithm was developed by Han in 2000 and is a powerful
tool for frequent pattern mining in data mining. It is widely used in
various applications such as market basket analysis, bioinformatics, and
web usage mining.
What is FP Growth Algorithm?
FP-Growth: allows frequent itemset discovery without candidate
itemset generation. Two step approach:
◦ Step 1: Build a compact data structure called the FP-tree
 Built using 2 passes over the data-set.
◦ Step 2: Extracts frequent itemsets directly from the FP-tree
Working on FP Growth Algorithm
The working of the FP Growth algorithm in data mining can be summarized in the
following steps:
Scan the database:
In this step, the algorithm scans the input dataset to determine the frequency of each
item. This determines the order in which items are added to the FP tree, with the most
frequent items added first.
Sort items:
In this step, the items in the dataset are sorted in descending order of frequency. The
infrequent items that do not meet the minimum support threshold are removed from the
dataset. This helps to reduce the dataset's size and improve the algorithm's efficiency.
Construct the FP-tree:
In this step, the FP-tree is constructed. The FP-tree is a compact data structure that stores the
frequent itemsets and their support counts.
Generate frequent itemsets:
Once the FP-tree has been constructed, frequent itemsets can be generated by recursively
mining the tree. Starting at the bottom of the tree, the algorithm finds all combinations of
frequent item sets that satisfy the minimum support threshold.
Generate association rules:
Once all frequent item sets have been generated, the algorithm post-processes the generated
frequent item sets to generate association rules, which can be used to identify interesting
relationships between the items in the dataset.
Now, for each item, the Conditional Pattern Base is computed
which is path labels of all the paths which lead to any node of
the given item in the frequent-pattern tree. Note that the items
in the below table are arranged in the ascending order of their
frequencies.
Y
M
O
Now for each item, the Conditional Frequent Pattern Tree is built. It is done by
taking the set of elements that is common in all the paths in the Conditional
Pattern Base of that item and calculating its support count by summing the
support counts of all the paths in the Conditional Pattern Base.
From the Conditional Frequent Pattern tree, the Frequent Pattern
rules are generated by pairing the items of the Conditional Frequent
Pattern Tree set to the corresponding to the item as given in the below
table.
Advantages of FP Growth Algorithm
The FP Growth algorithm in data mining has several advantages over other frequent itemset mining
algorithms, as mentioned below:
Efficiency:
FP Growth algorithm is faster and more memory-efficient than other frequent itemset mining
algorithms such as Apriori, especially on large datasets with high dimensionality. This is because it
generates frequent itemsets by constructing the FP-Tree, which compresses the database and
requires only two scans.
Scalability:
FP Growth algorithm scales well with increasing database size and itemset dimensionality, making
it suitable for mining frequent itemsets in large datasets.
Resistant to noise:
FP Growth algorithm is more resistant to noise in the data than other frequent itemset mining
algorithms, as it generates only frequent itemsets and ignores infrequent itemsets that may be
caused by noise.
Parallelization:
FP Growth algorithm can be easily parallelized, making it suitable for distributed computing
environments and allowing it to take advantage of multi-core processors.
Disadvantages of FP Growth Algorithm
While the FP Growth algorithm in data mining has several advantages, it also has
some limitations and disadvantages, as mentioned below:
Memory consumption:
Although the FP Growth algorithm is more memory-efficient than other frequent
itemset mining algorithms, storing the FP-Tree and the conditional pattern bases can
still require a significant amount of memory, especially for large datasets.
Complex implementation:
The FP Growth algorithm is more complex than other frequent itemset mining
algorithms, making it more difficult to understand and implement.
 Association rule learning is a type of unsupervised learning methods that
tests for the dependence of one data element on another data element
and create appropriately so that it can be more effective.
 It tries to discover some interesting relations or relations among the
variables of the dataset. It depends on several rules to find interesting
relations between variables in the database.
There are various applications of Association Rule which are as follows −
• Items purchased on a credit card, such as rental cars and hotel rooms, support
insight into the following product that customer are likely to buy.
• Optional services purchased by tele-connection users (call waiting, call
forwarding, DSL, speed call, etc.) support decide how to bundle these functions to
maximize revenue.
•Banking services used by retail users (money industry accounts, CDs, investment
services, car loans, etc.) recognize users likely to needed other services.
•Unusual group of insurance claims can be an expression of fraud and can spark
higher investigation.
•Medical patient histories can supports expressions of likely complications based
on definite set of treatments.

Association and Correlation analysis.....

  • 2.
     Data mining:semi-automatic discovery of interesting patterns from large data sets  Knowledge discovery is a process: ◦ Preprocessing ◦ Data mining ◦ Postprocessing Summary: What is Data Mining?
  • 3.
    Data mining Input dataResults Preprocessing Postprocessing Operational Database Utilization Cleaned Verified Focused Eval. of interes- tingness Raw data Time based selection Selected usable patterns 1 3 2 Summary: Typical KDD Process
  • 4.
    Basics FP Growth algorithm Examples AprioriAlgorithm Applications Association Rules
  • 5.
    • A setof items together is called an itemset. If any itemset has k- items it is called a k-itemset. • An itemset consists of two or more items. An itemset that occurs frequently is called a frequent itemset. Thus frequent itemset mining is a data mining technique to identify the items that often occur together. • For Example, Bread and butter, Laptop and Antivirus software, etc.
  • 6.
    • A setof items is called frequent if it satisfies a minimum threshold value for support and confidence. • Support shows transactions with items purchased together in a single transaction. • Confidence shows transactions where the items are purchased one after the other. • For frequent itemset mining method, we consider only those transactions which meet minimum threshold support and confidence requirements. • Insights from these mining algorithms offer a lot of benefits, cost-cutting and improved competitive advantage.
  • 7.
     The frequentpattern mining algorithm is one of the most important techniques of data mining to discover relationships between different items in a dataset.  FPM has many applications in the field of data analysis, software bugs, cross- marketing, sale campaign analysis, market basket analysis, etc.  Frequent itemsets discovered through Apriori have many applications in data mining tasks. Tasks such as finding interesting patterns in the database, finding out sequence and Mining of association rules is the most important of them.  Association rules apply to supermarket transaction data, that is, to examine the customer behavior in terms of the purchased products. Association rules describe how often the items are purchased together.
  • 8.
     This isthe most typical example of association mining. Data is collected using barcode scanners in most supermarkets.  This database, known as the “market basket” database, consists of a large number of records on past transactions.  A single record lists all the items bought by a customer in one sale. Knowing which groups are inclined towards which set of items gives these shops the freedom to adjust the store layout and the store catalogue to place the optimally concerning one another.
  • 9.
    When people buygreen tea, it is evident that they may also buy honey with it. This relationship is depicted as a conditional algorithm, as given below. IF {green tea} THEN {honey} It represents that items stated on the right are more likely to be ordered with the items on the left side. Market basket analysis in data mining helps us understand that relationship and how helpful it would be to alter our decisions based on the analysis.
  • 10.
     Typical representationformats for association rules: ◦ Green Tea  Honey[0.5%, 60%] ◦ buys: Green Tea  buys: Honey [0.5%, 60%] ◦ Green Tea and Honey are bought together in 0.5% of the rows in the database." ◦ "IF buys Green Tea , THEN buys Honey in 60% of the cases.
  • 11.
    Green Tea Honey [0.5%, 60%] 1 Antecedent, left-hand side (LHS), body 2 Consequent, right-hand side (RHS), head 3 Support, frequency ("in how big part of the data the things in left- and right-hand sides occur together") 4 Confidence, strength ("if the left-hand side occurs, how likely the right-hand side occurs") "IF buys Green Tea , THEN buys Honey in 60% of the cases in 0.5% of the rows" 1 2 3 4
  • 12.
    A=>B Support is ameasure of how frequently an itemset (A) appears in a dataset. It indicates the proportion of transactions or records in the dataset that contain the item set. Confidence: denotes probability that a transaction containing A also contains B.
  • 13.
    Support(X) = (Numberof transactions containing X) / (Total number of transactions) Confidence(X => Y) = (Number of transactions containing X and Y) / (Number of transactions containing X)
  • 15.
     Apriori algorithmis given by R. Agrawal and R. Srikant in 1994 for finding frequent item sets in a dataset for Boolean association rule.  Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset properties.  Apriori algorithm refers to the algorithm which is used to calculate the association rules between objects.  This algorithm uses two steps “join” and “prune” to reduce the search space. It is an iterative approach to discover the most frequent item sets.  It means how two or more objects are related to one another. In other words, we can say that the Apriori algorithm is an association rule leaning that analyses that people who bought product A also bought product.
  • 16.
     Method: ◦ Initially,scan DB once to get frequent 1-itemset. ◦ Generate length (k+1) candidate itemsets from length k frequent itemsets. ◦ Test the MINIMUM SUPPORT. ◦ Terminate when no frequent or candidate set can be generated
  • 17.
    • First, theset of frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and collecting those items that satisfy minimum support. The resulting set is denoted by L1. • Next, L1 is used to find L2 , the set of frequent 2-itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database.
  • 19.
    Find the frequentitemsets and generate association rules on this. Assume that minimum support threshold (s = 33.33%) and minimum confident threshold (c = 60%)
  • 20.
  • 21.
    There is onlyone itemset with minimum support 2. So only one itemset is
  • 22.
    Frequent Itemset (I)= {Coke, Chips}
  • 23.
    Advantages of AprioriAlgorithm • It is used to calculate large item sets. • Simple to understand and apply. Disadvantages of Apriori Algorithms • Apriori algorithm is an expensive method to find support since the calculation has to pass through the whole database. • Sometimes, you need a huge number of candidate rules, so it becomes computationally more expensive.
  • 24.
     Multiple databasescans are costly  Mining long patterns needs many passes of scanning and generates lots of candidates ◦ To find frequent itemset i1i2…i100  # of scans: 100  Bottleneck: candidate-generation-and-test  Can we avoid candidate generation?
  • 25.
     The FP-GrowthAlgorithm is an alternative way to find frequent item sets without using candidate generations, thus improving performance.  For so much, it uses a divide-and-conquer strategy. The core of this method is the usage of a special data structure named frequent-pattern tree (FP-tree), which retains the item set association information.
  • 26.
    The FP Growthalgorithm in data mining is a popular method for frequent pattern mining. The algorithm is efficient for mining frequent item sets in large datasets.  It works by constructing a frequent pattern tree (FP-tree) from the input dataset. FP Growth algorithm was developed by Han in 2000 and is a powerful tool for frequent pattern mining in data mining. It is widely used in various applications such as market basket analysis, bioinformatics, and web usage mining. What is FP Growth Algorithm?
  • 27.
    FP-Growth: allows frequentitemset discovery without candidate itemset generation. Two step approach: ◦ Step 1: Build a compact data structure called the FP-tree  Built using 2 passes over the data-set. ◦ Step 2: Extracts frequent itemsets directly from the FP-tree
  • 28.
    Working on FPGrowth Algorithm The working of the FP Growth algorithm in data mining can be summarized in the following steps: Scan the database: In this step, the algorithm scans the input dataset to determine the frequency of each item. This determines the order in which items are added to the FP tree, with the most frequent items added first. Sort items: In this step, the items in the dataset are sorted in descending order of frequency. The infrequent items that do not meet the minimum support threshold are removed from the dataset. This helps to reduce the dataset's size and improve the algorithm's efficiency.
  • 29.
    Construct the FP-tree: Inthis step, the FP-tree is constructed. The FP-tree is a compact data structure that stores the frequent itemsets and their support counts. Generate frequent itemsets: Once the FP-tree has been constructed, frequent itemsets can be generated by recursively mining the tree. Starting at the bottom of the tree, the algorithm finds all combinations of frequent item sets that satisfy the minimum support threshold. Generate association rules: Once all frequent item sets have been generated, the algorithm post-processes the generated frequent item sets to generate association rules, which can be used to identify interesting relationships between the items in the dataset.
  • 39.
    Now, for eachitem, the Conditional Pattern Base is computed which is path labels of all the paths which lead to any node of the given item in the frequent-pattern tree. Note that the items in the below table are arranged in the ascending order of their frequencies. Y M O
  • 40.
    Now for eachitem, the Conditional Frequent Pattern Tree is built. It is done by taking the set of elements that is common in all the paths in the Conditional Pattern Base of that item and calculating its support count by summing the support counts of all the paths in the Conditional Pattern Base.
  • 41.
    From the ConditionalFrequent Pattern tree, the Frequent Pattern rules are generated by pairing the items of the Conditional Frequent Pattern Tree set to the corresponding to the item as given in the below table.
  • 57.
    Advantages of FPGrowth Algorithm The FP Growth algorithm in data mining has several advantages over other frequent itemset mining algorithms, as mentioned below: Efficiency: FP Growth algorithm is faster and more memory-efficient than other frequent itemset mining algorithms such as Apriori, especially on large datasets with high dimensionality. This is because it generates frequent itemsets by constructing the FP-Tree, which compresses the database and requires only two scans. Scalability: FP Growth algorithm scales well with increasing database size and itemset dimensionality, making it suitable for mining frequent itemsets in large datasets. Resistant to noise: FP Growth algorithm is more resistant to noise in the data than other frequent itemset mining algorithms, as it generates only frequent itemsets and ignores infrequent itemsets that may be caused by noise. Parallelization: FP Growth algorithm can be easily parallelized, making it suitable for distributed computing environments and allowing it to take advantage of multi-core processors.
  • 58.
    Disadvantages of FPGrowth Algorithm While the FP Growth algorithm in data mining has several advantages, it also has some limitations and disadvantages, as mentioned below: Memory consumption: Although the FP Growth algorithm is more memory-efficient than other frequent itemset mining algorithms, storing the FP-Tree and the conditional pattern bases can still require a significant amount of memory, especially for large datasets. Complex implementation: The FP Growth algorithm is more complex than other frequent itemset mining algorithms, making it more difficult to understand and implement.
  • 59.
     Association rulelearning is a type of unsupervised learning methods that tests for the dependence of one data element on another data element and create appropriately so that it can be more effective.  It tries to discover some interesting relations or relations among the variables of the dataset. It depends on several rules to find interesting relations between variables in the database.
  • 60.
    There are variousapplications of Association Rule which are as follows − • Items purchased on a credit card, such as rental cars and hotel rooms, support insight into the following product that customer are likely to buy. • Optional services purchased by tele-connection users (call waiting, call forwarding, DSL, speed call, etc.) support decide how to bundle these functions to maximize revenue.
  • 61.
    •Banking services usedby retail users (money industry accounts, CDs, investment services, car loans, etc.) recognize users likely to needed other services. •Unusual group of insurance claims can be an expression of fraud and can spark higher investigation. •Medical patient histories can supports expressions of likely complications based on definite set of treatments.