2. Data Mining
• Data Mining or knowledge discovery, is the
computer-assisted process of digging through and
analyzing enormous sets of data and then extracting
the meaning of the data.
Example :
• Market Basket Analysis - Understand what products or
services are commonly purchased together
3. Association Analysis
• It is the most important model invented and extensively
studied by databases and data mining community.
• Proposed by Agrawal Rakesh, Srikrishna.
• Association rules are used to discover patterns that describe
strongly associated features in the data.
• Application - Business field where discovering of purchase
patterns or association between products is very useful for
decision-making and effective marketing.
4. Association Rules
• Association rules are of the form X->Y where X and Y are
disjoint item sets.
• The strength of an association rule can be determined
in terms of Support and Confidence.
5. Notations
Item set is a collection of zero or more
items.
If an item set contains ‘k’ items then it is
k-item set.
6. Procedure
Two subtasks:
Step 1 - Frequent Itemset Generation :It finds all the
itemsets that satisfy user-defined minsup threshold.
Step 2 - Rule Generation : It extracts all the high
confidence rules from the frequent itemsets found in
Step 1. These rules are called strong rules.
7. Support and Confidence
• Support : It determines how frequent the rule is
applicable in the transaction set T.
Let n be the number of transactions in T.
Support = Support(XUY)/n
Ex : Consider the rule {Milk, Diapers} -> {Beer}
Support = 2/5 = 0.4
• Confidence : The confidence of a rule is the percentage of
transactions in T that contain X also contain Y.
Confidence = Support(XUY)/Support(X)
Confidence = 2/3 = 0.666
8. The Apriori Principle
• If an itemset is frequent, then all of its subsets must also
be frequent.
• Apriori is the first association rule mining algorithm that
pioneered the use of support-based pruning to
systematically control the exponential growth of candidate
itemsets.
9. Apriori Algorithm
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
11. Rule Generation
The Apriori algorithm uses a level-wise approach for
generating association rules, where each level
corresponds to the number of items that belong to the
rule consequent.
13. Approaches for frequent itemset generation
BruteForce method :
Advantages :
• This method considers every k-itemset as a potential
candidate and then applies the candidate pruning step to
remove any unnecessary candidates.
Disadvantages :
• Candidate Pruning becomes extremely expensive because
a large number of itemsets must be examined.
14. Fk-1 X Fk-1 Itemset Generation
Advantages :
• This method merges a pair of frequent (k-1) itemsets
only if their first (k-2) are identical.
Disadvantages :
• This method requires an extra pruning step to ensure
that the remaining (k-2) subsets are frequent itemsets.
15. Fk-1 X F1 Itemset Generation
Advantages :
• This method takes frequent (k-1) itemsets and extends
them other frequent itemsets. For instance it takes 2-
frequent itemsets and combines them with frequent 1-
itemset.
Disadvantages :
• This method does not prevent the same candidate itemset
from being generated more than once.