Association Rule Mining in Data Mining.pptx

Association Rule Mining
 Proposed: by Agrawal et al in 1993.
 Definition: A technique in data mining for discovering interesting relationships between
variables in large datasets.
 Purpose: Identifies patterns, correlations, or causal structures among sets of items.
 Application:
 Market Basket Analysis
 Cross selling strategies
 Customer segmentation

Market Basket Analysis
 Object: To understand the purchase behavior of customers by identifying
the set of products bought together.
 Association rules:
A ⇒ D (customers who purchase product A also tend to buy product D at the same time)

 Key Concepts:
 Itemset – set of items (Eg : x = {milk, bread, cereal} is an itemset)
 K-itemset – an itemset with k items (Eg : x = {milk, bread, cereal} is a
 3-itemset)
 Support(s)-Fraction of transactions that contain both X & Y
 Confidence(c)-Measures how often items in Y appear in transactions that contain
X
 Process:
 Step 1: Collect and preprocess data (sales transactions).
 Step 2: Identify frequent itemsets using algorithms (e.g., Apriori, FP-Growth, Eclat
Algorithm).
 Step 3: Generate association rules from these itemsets.
 Step 4: Evaluate the rules based on metrics (support, confidence, lift).

Market Basket Analysis - Example
 Ex-{Shoes,Trouser} =>Shirt
 S=Frequent(Shoes,Trouser, Shirt)/N
=2/4
=0.5
 C=Frequent(Shoes,Trouser, Shirt)/Frequent(Shoes,Trouser)
=2/3
=0.67

Apriori Algorithm
 The Apriori algorithm is a technique for finding frequent itemsets in transactional data.
 Steps :
 Define Minimum Support and Minimum Confidence Thresholds
 Find Frequent Individual Items
 Generate Candidate Itemsets of Size 2
 Prune Infrequent Candidate Itemsets
 Generate Candidate Itemsets of Size 3
 Continue Generating and Pruning Candidate Itemsets Until No More Frequent Itemsets Are
Found
 Generate Association Rules from Frequent Itemsets

Suppose Min_confidence = 40%
If confidence >= min_condidence
We accept the Rule
{C1R,A2R,A2C}
Subset={C1R,A2R},{C1R,A2C},{A2R,A2C},{C1R},{A2R},{A2C}
 Rule 1- {C1R,A2R}
Confidence=support(C1R,A2R,A2C)/support(C1R,A2R)
=(2/4)/(2/4)
=100%>40%
So Rule 1 is Accepted.
 Rule 2- {C1R,A2C}
Confidence=support(C1R,A2R,A2C)/support(C1R,A2C)
=(2/4)/(3/4)
=2/3
=66.6%>40%
So Rule 2 is Accepted.
Like this we check all subsets .

Apriori Advantages/Disadvantages
 Advantages
 Uses large itemset property
 Easily parallized
 Easy to implement
 Disadvantages
 Assumes transaction database is more resident
 Requiers many database scans

Frequent Pattern Growth Approach
 An efficient data mining algorithm for finding frequent itemsets without
candidate generation.
 Ideal for large datasets where the Apriori algorithm's performance degrades due
to the generation of a large number of candidate sets.
 Steps
 Calculate minimum support.
 Find frequency of occurence.
 Prioritize the items
 Order the items according to prioriy
 Validation

Frequent Pattern Growth - Example
TID Item
1 E,A,D,B
2 D,A,C,E,B
3 C,A,B,E
4 B,A,D
5 D
6 D,B
7 A,D,E
8 B,C
Assume minimum support = 30%
Minimum support count = (30/100 * 8 ) = 2.4

TID Item Ordered item
1 E,A,D,B B,A,D,E
2 D,A,C,E,B B,D,A,E,C
3 C,A,B,E B,A,E,C
4 B,A,D B,D,A
5 D D
6 D,B B,D
7 A,D,E D,A,E
8 B,C B,C
TID Frequency
A 5
B 6
C 3
D 6
E 4
TID Frequency Priority
A 5 3
B 6 1
C 3 5
D 6 2
E 4 4

Advantages of FP-Growth
 Increased Efficiency
 Requires only two passes over the dataset, minimizing the time and resources
needed for data processing.
 Significantly faster as it avoids the exhaustive candidate generation phase that
Apriori undergoes.
 Elimination of Candidate Generation
 Directly mines the frequent itemsets from the FP-tree without generating
candidate itemsets.
 Lower Memory Requirements
 The FP-tree data structure is compact and often requires less memory than
storing extensive candidate sets, especially beneficial for dense datasets.

Eclat Algorithm
 Eclat stands for Equivalence Class Clustering and bottom-up
Lattice Traversal.
 It is a depth-first search algorithm used for mining frequent
itemsets, introduced as an efficient alternative to the Apriori
algorithm.
 Focuses on vertical database format, where transactions are
represented as sets of items.

Example for Eclat Algorithm
T ID Itemset
T1 Bread,Butter,Jam
T2 Butter, Coke
T3 Butter, Milk
T4 Bread, Butter, Coke
T5 Bread, Milk,
T6 Butter, Milk,
T7 Bread, Milk
T8 Bread, Butter, Milk, Jam
T9 Bread, Butter,Milk
T ID Bread Butter Milk Coke Jam
T1 1 1 0 0 1
T2 0 1 0 1 0
T3 0 1 1 0 0
T4 1 1 0 1 0
T5 1 0 1 0 0
T6 0 1 1 0 0
T7 1 0 1 0 0
T8 1 1 1 0 1
T9 1 1 1 0 0
K=1,Minimum Support=2
Item T ID Set
Bread {T1,T4,T5,T7,T8,T9}
Butter {T1,T2,T3,T4,T6,T8,T9}
Milk {T3,T5,T6,T7,T8,T9}
Coke {T2,T4}
Jam {T1,T8}

Item T ID Set
{Bread,Butter} {T1,T4,T8,T9}
{Bread,Milk} {T5,T7,T8,T9}
{Bread,Coke} {T4}
{Bread,Jam} {T1,T8}
{Butter,Milk} {T3,T6,T8,T9}
{Butter,Coke} {T2,T4}
{Butter,Jam} {T1,T8}
{Milk,Jam} {T8}
Item T ID Set
{Bread,Butter,Milk} {{T8,T9}
{Bread,Butter,Jam} {T1,T8}
{Bread,Butter,Coke} {T4}
{Bread,Milk,Jam} {T8}
Item T ID Set
{Bread,Butter,Milk,Jam} {T8}
K=2 K=3
K=4

Item Bought Recommended Products
Bread Butter
Bread Milk
Bread Jam
Butter Milk
Butter Jam
Bread and Butter Milk
Bread and Butter Jam
• stop at k=4 because there are no
more item-Tid Set pairs to combine
• minimum support =2, we conclude
the following rules form the given
dataset

Eclat Algorithm Advantages/Disadvantages
 Advantages
 Depth-first search reduces memory requirements
 Usually (considerably) faster than Apriori
 No need to scan the database to find the support of (k+1)
itemsets, for k>=1
 Disadvantages
 The TID-sets can be quite long, hence expensive to manipulate

Association Rule Mining in Data Mining.pptx

More Related Content

What's hot

Similar to Association Rule Mining in Data Mining.pptx

Recently uploaded

Association Rule Mining in Data Mining.pptx