Association Rule Mining
Association Rule Mining
 Proposed: by Agrawal et al in 1993.
 Definition: A technique in data mining for discovering interesting relationships between
variables in large datasets.
 Purpose: Identifies patterns, correlations, or causal structures among sets of items.
 Application:
 Market Basket Analysis
 Cross selling strategies
 Customer segmentation
Market Basket Analysis
 Object: To understand the purchase behavior of customers by identifying
the set of products bought together.
 Association rules:
A ⇒ D (customers who purchase product A also tend to buy product D at the same time)
 Key Concepts:
 Itemset – set of items (Eg : x = {milk, bread, cereal} is an itemset)
 K-itemset – an itemset with k items (Eg : x = {milk, bread, cereal} is a
 3-itemset)
 Support(s)-Fraction of transactions that contain both X & Y
 Confidence(c)-Measures how often items in Y appear in transactions that contain
X
 Process:
 Step 1: Collect and preprocess data (sales transactions).
 Step 2: Identify frequent itemsets using algorithms (e.g., Apriori, FP-Growth, Eclat
Algorithm).
 Step 3: Generate association rules from these itemsets.
 Step 4: Evaluate the rules based on metrics (support, confidence, lift).
Market Basket Analysis - Example
 Ex-{Shoes,Trouser} =>Shirt
 S=Frequent(Shoes,Trouser, Shirt)/N
=2/4
=0.5
 C=Frequent(Shoes,Trouser, Shirt)/Frequent(Shoes,Trouser)
=2/3
=0.67
Apriori Algorithm
 The Apriori algorithm is a technique for finding frequent itemsets in transactional data.
 Steps :
 Define Minimum Support and Minimum Confidence Thresholds
 Find Frequent Individual Items
 Generate Candidate Itemsets of Size 2
 Prune Infrequent Candidate Itemsets
 Generate Candidate Itemsets of Size 3
 Continue Generating and Pruning Candidate Itemsets Until No More Frequent Itemsets Are
Found
 Generate Association Rules from Frequent Itemsets
Apriori Algorithm - Example
Suppose Min_confidence = 40%
If confidence >= min_condidence
We accept the Rule
{C1R,A2R,A2C}
Subset={C1R,A2R},{C1R,A2C},{A2R,A2C},{C1R},{A2R},{A2C}
 Rule 1- {C1R,A2R}
Confidence=support(C1R,A2R,A2C)/support(C1R,A2R)
=(2/4)/(2/4)
=100%>40%
So Rule 1 is Accepted.
 Rule 2- {C1R,A2C}
Confidence=support(C1R,A2R,A2C)/support(C1R,A2C)
=(2/4)/(3/4)
=2/3
=66.6%>40%
So Rule 2 is Accepted.
Like this we check all subsets .
Apriori Advantages/Disadvantages
 Advantages
 Uses large itemset property
 Easily parallized
 Easy to implement
 Disadvantages
 Assumes transaction database is more resident
 Requiers many database scans
Frequent Pattern Growth Approach
 An efficient data mining algorithm for finding frequent itemsets without
candidate generation.
 Ideal for large datasets where the Apriori algorithm's performance degrades due
to the generation of a large number of candidate sets.
 Steps
 Calculate minimum support.
 Find frequency of occurence.
 Prioritize the items
 Order the items according to prioriy
 Validation
Frequent Pattern Growth - Example
TID Item
1 E,A,D,B
2 D,A,C,E,B
3 C,A,B,E
4 B,A,D
5 D
6 D,B
7 A,D,E
8 B,C
Assume minimum support = 30%
Minimum support count = (30/100 * 8 ) = 2.4
TID Item Ordered item
1 E,A,D,B B,A,D,E
2 D,A,C,E,B B,D,A,E,C
3 C,A,B,E B,A,E,C
4 B,A,D B,D,A
5 D D
6 D,B B,D
7 A,D,E D,A,E
8 B,C B,C
TID Frequency
A 5
B 6
C 3
D 6
E 4
TID Frequency Priority
A 5 3
B 6 1
C 3 5
D 6 2
E 4 4
Advantages of FP-Growth
 Increased Efficiency
 Requires only two passes over the dataset, minimizing the time and resources
needed for data processing.
 Significantly faster as it avoids the exhaustive candidate generation phase that
Apriori undergoes.
 Elimination of Candidate Generation
 Directly mines the frequent itemsets from the FP-tree without generating
candidate itemsets.
 Lower Memory Requirements
 The FP-tree data structure is compact and often requires less memory than
storing extensive candidate sets, especially beneficial for dense datasets.
Eclat Algorithm
 Eclat stands for Equivalence Class Clustering and bottom-up
Lattice Traversal.
 It is a depth-first search algorithm used for mining frequent
itemsets, introduced as an efficient alternative to the Apriori
algorithm.
 Focuses on vertical database format, where transactions are
represented as sets of items.
Example for Eclat Algorithm
T ID Itemset
T1 Bread,Butter,Jam
T2 Butter, Coke
T3 Butter, Milk
T4 Bread, Butter, Coke
T5 Bread, Milk,
T6 Butter, Milk,
T7 Bread, Milk
T8 Bread, Butter, Milk, Jam
T9 Bread, Butter,Milk
T ID Bread Butter Milk Coke Jam
T1 1 1 0 0 1
T2 0 1 0 1 0
T3 0 1 1 0 0
T4 1 1 0 1 0
T5 1 0 1 0 0
T6 0 1 1 0 0
T7 1 0 1 0 0
T8 1 1 1 0 1
T9 1 1 1 0 0
K=1,Minimum Support=2
Item T ID Set
Bread {T1,T4,T5,T7,T8,T9}
Butter {T1,T2,T3,T4,T6,T8,T9}
Milk {T3,T5,T6,T7,T8,T9}
Coke {T2,T4}
Jam {T1,T8}
Item T ID Set
{Bread,Butter} {T1,T4,T8,T9}
{Bread,Milk} {T5,T7,T8,T9}
{Bread,Coke} {T4}
{Bread,Jam} {T1,T8}
{Butter,Milk} {T3,T6,T8,T9}
{Butter,Coke} {T2,T4}
{Butter,Jam} {T1,T8}
{Milk,Jam} {T8}
Item T ID Set
{Bread,Butter,Milk} {{T8,T9}
{Bread,Butter,Jam} {T1,T8}
{Bread,Butter,Coke} {T4}
{Bread,Milk,Jam} {T8}
Item T ID Set
{Bread,Butter,Milk,Jam} {T8}
K=2 K=3
K=4
Item Bought Recommended Products
Bread Butter
Bread Milk
Bread Jam
Butter Milk
Butter Jam
Bread and Butter Milk
Bread and Butter Jam
• stop at k=4 because there are no
more item-Tid Set pairs to combine
• minimum support =2, we conclude
the following rules form the given
dataset
Eclat Algorithm Advantages/Disadvantages
 Advantages
 Depth-first search reduces memory requirements
 Usually (considerably) faster than Apriori
 No need to scan the database to find the support of (k+1)
itemsets, for k>=1
 Disadvantages
 The TID-sets can be quite long, hence expensive to manipulate
Thank You!

Association Rule Mining in Data Mining.pptx

  • 1.
  • 2.
    Association Rule Mining Proposed: by Agrawal et al in 1993.  Definition: A technique in data mining for discovering interesting relationships between variables in large datasets.  Purpose: Identifies patterns, correlations, or causal structures among sets of items.  Application:  Market Basket Analysis  Cross selling strategies  Customer segmentation
  • 3.
    Market Basket Analysis Object: To understand the purchase behavior of customers by identifying the set of products bought together.  Association rules: A ⇒ D (customers who purchase product A also tend to buy product D at the same time)
  • 4.
     Key Concepts: Itemset – set of items (Eg : x = {milk, bread, cereal} is an itemset)  K-itemset – an itemset with k items (Eg : x = {milk, bread, cereal} is a  3-itemset)  Support(s)-Fraction of transactions that contain both X & Y  Confidence(c)-Measures how often items in Y appear in transactions that contain X  Process:  Step 1: Collect and preprocess data (sales transactions).  Step 2: Identify frequent itemsets using algorithms (e.g., Apriori, FP-Growth, Eclat Algorithm).  Step 3: Generate association rules from these itemsets.  Step 4: Evaluate the rules based on metrics (support, confidence, lift).
  • 5.
    Market Basket Analysis- Example  Ex-{Shoes,Trouser} =>Shirt  S=Frequent(Shoes,Trouser, Shirt)/N =2/4 =0.5  C=Frequent(Shoes,Trouser, Shirt)/Frequent(Shoes,Trouser) =2/3 =0.67
  • 6.
    Apriori Algorithm  TheApriori algorithm is a technique for finding frequent itemsets in transactional data.  Steps :  Define Minimum Support and Minimum Confidence Thresholds  Find Frequent Individual Items  Generate Candidate Itemsets of Size 2  Prune Infrequent Candidate Itemsets  Generate Candidate Itemsets of Size 3  Continue Generating and Pruning Candidate Itemsets Until No More Frequent Itemsets Are Found  Generate Association Rules from Frequent Itemsets
  • 7.
  • 8.
    Suppose Min_confidence =40% If confidence >= min_condidence We accept the Rule {C1R,A2R,A2C} Subset={C1R,A2R},{C1R,A2C},{A2R,A2C},{C1R},{A2R},{A2C}  Rule 1- {C1R,A2R} Confidence=support(C1R,A2R,A2C)/support(C1R,A2R) =(2/4)/(2/4) =100%>40% So Rule 1 is Accepted.  Rule 2- {C1R,A2C} Confidence=support(C1R,A2R,A2C)/support(C1R,A2C) =(2/4)/(3/4) =2/3 =66.6%>40% So Rule 2 is Accepted. Like this we check all subsets .
  • 9.
    Apriori Advantages/Disadvantages  Advantages Uses large itemset property  Easily parallized  Easy to implement  Disadvantages  Assumes transaction database is more resident  Requiers many database scans
  • 10.
    Frequent Pattern GrowthApproach  An efficient data mining algorithm for finding frequent itemsets without candidate generation.  Ideal for large datasets where the Apriori algorithm's performance degrades due to the generation of a large number of candidate sets.  Steps  Calculate minimum support.  Find frequency of occurence.  Prioritize the items  Order the items according to prioriy  Validation
  • 11.
    Frequent Pattern Growth- Example TID Item 1 E,A,D,B 2 D,A,C,E,B 3 C,A,B,E 4 B,A,D 5 D 6 D,B 7 A,D,E 8 B,C Assume minimum support = 30% Minimum support count = (30/100 * 8 ) = 2.4
  • 12.
    TID Item Ordereditem 1 E,A,D,B B,A,D,E 2 D,A,C,E,B B,D,A,E,C 3 C,A,B,E B,A,E,C 4 B,A,D B,D,A 5 D D 6 D,B B,D 7 A,D,E D,A,E 8 B,C B,C TID Frequency A 5 B 6 C 3 D 6 E 4 TID Frequency Priority A 5 3 B 6 1 C 3 5 D 6 2 E 4 4
  • 14.
    Advantages of FP-Growth Increased Efficiency  Requires only two passes over the dataset, minimizing the time and resources needed for data processing.  Significantly faster as it avoids the exhaustive candidate generation phase that Apriori undergoes.  Elimination of Candidate Generation  Directly mines the frequent itemsets from the FP-tree without generating candidate itemsets.  Lower Memory Requirements  The FP-tree data structure is compact and often requires less memory than storing extensive candidate sets, especially beneficial for dense datasets.
  • 15.
    Eclat Algorithm  Eclatstands for Equivalence Class Clustering and bottom-up Lattice Traversal.  It is a depth-first search algorithm used for mining frequent itemsets, introduced as an efficient alternative to the Apriori algorithm.  Focuses on vertical database format, where transactions are represented as sets of items.
  • 16.
    Example for EclatAlgorithm T ID Itemset T1 Bread,Butter,Jam T2 Butter, Coke T3 Butter, Milk T4 Bread, Butter, Coke T5 Bread, Milk, T6 Butter, Milk, T7 Bread, Milk T8 Bread, Butter, Milk, Jam T9 Bread, Butter,Milk T ID Bread Butter Milk Coke Jam T1 1 1 0 0 1 T2 0 1 0 1 0 T3 0 1 1 0 0 T4 1 1 0 1 0 T5 1 0 1 0 0 T6 0 1 1 0 0 T7 1 0 1 0 0 T8 1 1 1 0 1 T9 1 1 1 0 0 K=1,Minimum Support=2 Item T ID Set Bread {T1,T4,T5,T7,T8,T9} Butter {T1,T2,T3,T4,T6,T8,T9} Milk {T3,T5,T6,T7,T8,T9} Coke {T2,T4} Jam {T1,T8}
  • 17.
    Item T IDSet {Bread,Butter} {T1,T4,T8,T9} {Bread,Milk} {T5,T7,T8,T9} {Bread,Coke} {T4} {Bread,Jam} {T1,T8} {Butter,Milk} {T3,T6,T8,T9} {Butter,Coke} {T2,T4} {Butter,Jam} {T1,T8} {Milk,Jam} {T8} Item T ID Set {Bread,Butter,Milk} {{T8,T9} {Bread,Butter,Jam} {T1,T8} {Bread,Butter,Coke} {T4} {Bread,Milk,Jam} {T8} Item T ID Set {Bread,Butter,Milk,Jam} {T8} K=2 K=3 K=4
  • 18.
    Item Bought RecommendedProducts Bread Butter Bread Milk Bread Jam Butter Milk Butter Jam Bread and Butter Milk Bread and Butter Jam • stop at k=4 because there are no more item-Tid Set pairs to combine • minimum support =2, we conclude the following rules form the given dataset
  • 19.
    Eclat Algorithm Advantages/Disadvantages Advantages  Depth-first search reduces memory requirements  Usually (considerably) faster than Apriori  No need to scan the database to find the support of (k+1) itemsets, for k>=1  Disadvantages  The TID-sets can be quite long, hence expensive to manipulate
  • 20.