Data Warehouse and Data Mining Unit III.docx

Unit III : Association Rule Mining
Association Rule: Problem Definition
Association rule mining finds interesting associations and relationships among large sets of
data items. This rule shows how frequently a itemset occurs in a transaction. A typical example
is a Market Based Analysis. Market Based Analysis is one of the key techniques used by large
relations to show associations between items.It allows retailers to identify relationships
between the items that people buy together frequently. Given a set of transactions, we can find
rules that will predict the occurrence of an item based on the occurrences of other items in the
transaction.
Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can be more
profitable.
Key Concepts:
1. Itemset: A collection of items (or attributes) that frequently appear together in
transactions.
o Example: In a grocery store, items like bread, milk, and butter may often be
bought together.
2. Association Rule: An implication of the form A→BA rightarrow BA→B, where:
o AAA is a set of items (antecedent).
o BBB is another set of items (consequent).
o The rule suggests that if AAA occurs, BBB is likely to occur as well.
o Example: "If a customer buys bread and butter, they are likely to buy milk."
Steps in Association Rule Mining:
1. Frequent Itemset Generation: Find all itemsets that occur frequently in the dataset,
based on a predefined minimum support threshold.

2. Rule Generation: For each frequent itemset, generate possible association rules that
meet the minimum confidence threshold.
3. Evaluation: Evaluate the generated rules using metrics like support, confidence, and lift
to determine their usefulness.
Example of Association Rule Mining:
Dataset (Transaction Data):
We have a small dataset with 5 transactions at a grocery store:
Transaction ID Items Bought
1 Bread, Butter
2 Bread, Milk
3 Bread, Butter, Milk
4 Butter, Milk
5 Bread, Butter, Milk, Jam
Step 1: Frequent Itemset Generation
We first need to find frequent itemsets that meet a minimum support threshold. Let's say our
minimum support is 60%, which means at least 3 transactions should contain the itemset.
1.1 Itemsets of size 1:
 Bread appears in transactions 1, 2, 3, 5 → 4 transactions
 Butter appears in transactions 1, 3, 4, 5 → 4 transactions
 Milk appears in transactions 2, 3, 4, 5 → 4 transactions
 Jam appears in transaction 5 → 1 transaction
Support for each of these single items:
 Support(Bread) = 4/5 = 80%
 Support(Butter) = 4/5 = 80%
 Support(Milk) = 4/5 = 80%
 Support(Jam) = 1/5 = 20%
Since the minimum support is 60%, Jam is not a frequent itemset, while Bread, Butter, and Milk
are.

1.2 Itemsets of size 2 (pairs):
 Bread, Butter appears in transactions 1, 3, 5 → 3 transactions
 Bread, Milk appears in transactions 2, 3, 5 → 3 transactions
 Butter, Milk appears in transactions 3, 4, 5 → 3 transactions
Support for each pair:
 Support(Bread, Butter) = 3/5 = 60%
 Support(Bread, Milk) = 3/5 = 60%
 Support(Butter, Milk) = 3/5 = 60%
Since all pairs meet the minimum support, they are frequent itemsets.
1.3 Itemsets of size 3 (triplets):
 Bread, Butter, Milk appears in transactions 3, 5 → 2 transactions (not frequent)
Step 2: Rule Generation
Now, we generate association rules from the frequent itemsets.
 From the frequent itemset {Bread, Butter}, we can create the rule:
o Bread → Butter (If a customer buys Bread, they are likely to buy Butter)
o Butter → Bread (If a customer buys Butter, they are likely to buy Bread)
 From the frequent itemset {Bread, Milk}, we can create the rule:
o Bread → Milk (If a customer buys Bread, they are likely to buy Milk)
o Milk → Bread (If a customer buys Milk, they are likely to buy Bread)
 From the frequent itemset {Butter, Milk}, we can create the rule:
o Butter → Milk (If a customer buys Butter, they are likely to buy Milk)
o Milk → Butter (If a customer buys Milk, they are likely to buy Butter)
Step 3: Evaluate Rules Using Confidence
Next, let's calculate the confidence for the rule Bread → Butter:
 Confidence(Bread → Butter) = Support(Bread, Butter) / Support(Bread) = 3/4 = 0.75 or 75%
Similarly, we can calculate confidence for other rules:
 Confidence(Bread → Milk) = 3/4 = 0.75 or 75%
 Confidence(Butter → Milk) = 3/4 = 0.75 or 75%
Step 4: Final Rules

Based on support and confidence, we can conclude the following strong association rules:
 Bread → Butter (75% confidence)
 Bread → Milk (75% confidence)
 Butter → Milk (75% confidence)
Frequent Itemset Generation
It is a crucial step in association rule mining. The goal of frequent itemset generation is to find
itemsets (combinations of items) that appear frequently in the dataset based on a specified
support threshold.
Key Terminology:
Steps for Frequent Itemset Generation:
Example:
Consider the following dataset (5 transactions) with 4 items: {Bread, Butter, Milk, Jam}.
1 Bread, Butter
2 Bread, Milk
4 Butter, Milk

Step 1.1: Count occurrences of individual items:
 Bread: Appears in transactions 1, 2, 3, 5 → 4 times.
 Butter: Appears in transactions 1, 3, 4, 5 → 4 times.
 Milk: Appears in transactions 2, 3, 4, 5 → 4 times.
 Jam: Appears in transaction 5 → 1 time.
Step 1.2: Calculate support for each item:
 Support(Bread) = 4/5 = 0.80 (80%)
 Support(Butter) = 4/5 = 0.80 (80%)
 Support(Milk) = 4/5 = 0.80 (80%)
 Support(Jam) = 1/5 = 0.20 (20%)
Step 1.3: Apply Minimum Support Threshold:
If the minimum support threshold is 60% (0.60), then Jam is not a frequent item (since its
support is below 60%), but Bread, Butter, and Milk are frequent items.
2. Generate Itemsets of Size-2 (Pairs of Items):
Now that we have the frequent size-1 itemsets, we generate all possible combinations (itemsets)
of size 2 (pairs of items). Then, we count the occurrences of each pair and calculate their support.
Step 2.1: Count occurrences of pairs:
 Bread, Butter: Appears in transactions 1, 3, 5 → 3 times.
 Bread, Milk: Appears in transactions 2, 3, 5 → 3 times.
 Butter, Milk: Appears in transactions 3, 4, 5 → 3 times.
Step 2.2: Calculate support for each pair:
 Support(Bread, Butter) = 3/5 = 0.60 (60%)
 Support(Bread, Milk) = 3/5 = 0.60 (60%)
 Support(Butter, Milk) = 3/5 = 0.60 (60%)
Since the support for each of these pairs is 60%, they all meet the minimum support threshold, so
they are all frequent itemsets.
3. Generate Itemsets of Size-3 (Triplets of Items):
Next, we generate itemsets of size 3 by combining frequent size-2 itemsets. We count the
occurrences of each triplet and calculate their support.

Step 3.1: Count occurrences of triplets:
 Bread, Butter, Milk: Appears in transactions 3, 5 → 2 times.
Step 3.2: Calculate support for the triplet:
 Support(Bread, Butter, Milk) = 2/5 = 0.40 (40%)
The support for {Bread, Butter, Milk} is 40%, which is below the minimum support threshold
of 60%. Therefore, this is not a frequent itemset.
4. Stop When No More Frequent Itemsets Are Found:
At this point, we stop because no new frequent itemsets of size 3 or greater meet the minimum
support threshold.
Final Frequent Itemsets:
 Frequent Itemsets of Size 1: {Bread}, {Butter}, {Milk}
 Frequent Itemsets of Size 2: {Bread, Butter}, {Bread, Milk}, {Butter, Milk}
 Frequent Itemsets of Size 3: None
The APRIORY Principle:
Apriori Algorithm is a foundational method in data mining used for discovering frequent
itemsets and generating association rules. Its significance lies in its ability to identify
relationships between items in large datasets which is particularly valuable in market basket
analysis.
Apriori Principle
The Apriori Principle is a key concept in the Apriori Algorithm, which is widely used in
association rule mining to find frequent itemsets in large datasets. It is based on the simple
observation that if an itemset is frequent, then all of its subsets must also be frequent.
In other words, it states:
 If an itemset AAA is frequent (i.e., meets the minimum support threshold), then all non-
empty subsets of AAA must also be frequent.
 Conversely, if any subset of an itemset is not frequent, then the entire itemset cannot be
frequent.

This principle helps reduce the search space and computational effort by pruning itemsets that
cannot possibly be frequent.
How the Apriori Principle Works:
 Step 1: Identify all frequent itemsets of size 1 (individual items) based on the minimum
support threshold.
 Step 2: Use the Apriori Principle to generate candidate itemsets of size 2 (pairs of items)
from the frequent itemsets of size 1.
 Step 3: Prune the candidate itemsets by checking if all their subsets are frequent. If any
subset of an itemset is not frequent, prune that itemset.
 Step 4: Repeat the process for higher-order itemsets (size 3, size 4, etc.), always pruning
non-frequent subsets.
Example to Illustrate the Apriori Principle:
Let's consider a simple dataset of transactions in a store.
1 Bread, Butter
2 Bread, Milk
4 Butter, Milk
Step 1: Find Frequent Itemsets of Size 1 (Individual Items)
We first find the frequency of individual items in the dataset.
Item Support (Frequency) Support (%)
Bread 4 4/5 = 0.80
Butter 4 4/5 = 0.80
Milk 4 4/5 = 0.80
Jam 1 1/5 = 0.20

Assume the minimum support threshold is 60% (0.60). The items Bread, Butter, and Milk are
frequent (support ≥ 60%), but Jam is not frequent (support = 20%).
Step 2: Generate Candidate Itemsets of Size 2 (Pairs of Items)
Now, we generate candidate pairs from the frequent items:
 Bread, Butter
 Bread, Milk
 Butter, Milk
Subsets Check (Apriori Principle)
Before calculating the support of the pairs, we apply the Apriori Principle:
 {Bread, Butter}: The subsets are {Bread} and {Butter}, both of which are frequent (from Step
1).
 {Bread, Milk}: The subsets are {Bread} and {Milk}, both of which are frequent.
 {Butter, Milk}: The subsets are {Butter} and {Milk}, both of which are frequent.
Since all subsets of the candidate pairs are frequent, we proceed to count their occurrences.
Step 3: Count Occurrences of Size 2 Itemsets
Count how many transactions contain each of these pairs:
Pair Support (Frequency) Support (%)
Bread, Butter 3 3/5 = 0.60
Bread, Milk 3 3/5 = 0.60
Butter, Milk 3 3/5 = 0.60
All of these pairs meet the minimum support threshold (60%), so they are frequent itemsets.
Step 4: Generate Candidate Itemsets of Size 3 (Triplets of Items)
Next, we generate candidate triplets from the frequent size-2 itemsets:
 Bread, Butter, Milk
Subsets Check (Apriori Principle)
Now, apply the Apriori Principle to the candidate triplet {Bread, Butter, Milk}:

 Subsets of {Bread, Butter, Milk}: {Bread, Butter}, {Bread, Milk}, and {Butter, Milk} — all of
these are frequent from Step 2.
Since all subsets are frequent, we proceed to count the occurrences of the triplet.
Step 5: Count Occurrences of Size 3 Itemsets
Count how many transactions contain the triplet {Bread, Butter, Milk}:
Triplet Support (Frequency) Support (%)
Bread, Butter, Milk 2 2/5 = 0.40
The support of {Bread, Butter, Milk} is 40%, which is below the minimum support threshold
(60%). Therefore, {Bread, Butter, Milk} is not frequent.
 Size 1 Frequent Itemsets: {Bread}, {Butter}, {Milk}
 Size 2 Frequent Itemsets: {Bread, Butter}, {Bread, Milk}, {Butter, Milk}
 Size 3 Frequent Itemsets: None (because the triplet didn’t meet the support threshold)
The Apriori Principle helps us efficiently reduce the number of candidate itemsets by pruning
non-frequent subsets.
In this example, we used the principle to prune the candidate triplet {Bread, Butter, Milk} early
in the process, saving computational effort.
The frequent itemsets identified were those with support ≥ 60%, and we stop further exploration
once no more frequent itemsets can be generated.
Apriory Algorithm:
The Apriori Algorithm is one of the most widely used algorithms in data mining, specifically
for association rule mining. It is used to find frequent itemsets in large datasets and derive
association rules from them. The Apriori Algorithm operates on the principle that if an itemset
is frequent, all of its subsets must also be frequent. This principle is used to reduce the number of
candidate itemsets, making the algorithm more efficient.
Steps of the Apriori Algorithm:

The Apriori algorithm follows a bottom-up approach to find frequent itemsets in a transaction
database. It works in multiple steps:
1. Generate Candidate Itemsets: Start by finding frequent itemsets of size 1 (individual
items) and then extend them to size 2, size 3, etc., iteratively.
2. Prune Non-Frequent Itemsets: At each step, use the Apriori principle to prune
itemsets that have subsets that are not frequent.
3. Generate Association Rules: Once frequent itemsets are found, generate association
rules with high confidence and evaluate their usefulness.
Apriori Algorithm – Step-by-Step:
Step 1: Identify Frequent Itemsets of Size 1
Step 2: Generate Candidate Itemsets of Size 2
 Combine frequent items from step 1 to generate candidate itemsets of size 2 (pairs of
items).
 For example, if {Bread} and {Butter} are frequent, the candidate itemset is {Bread,
Butter}.
 Count the support for each of the candidate itemsets by scanning the transactions again.
Step 3: Prune Non-Frequent Itemsets
 After generating candidate itemsets, prune the ones whose subsets are not frequent
(according to the Apriori principle).
 If an itemset {A, B} is frequent but any of its subsets, such as {A} or {B}, is not frequent,
then {A, B} cannot be frequent.
Step 4: Repeat for Larger Itemsets
 Once you find the frequent itemsets of size 2, move on to size 3, size 4, and so on.
Continue combining frequent itemsets of size kkk to generate candidate itemsets of size
k+1k+1k+1.
 Repeat the process of counting support and pruning non-frequent itemsets for each new
level of itemsets until no more frequent itemsets can be found.

Step 5: Generate Association Rules
Example:
Consider a small dataset of 5 transactions with 4 items:
1 Bread, Butter
2 Bread, Milk
4 Butter, Milk
Step 1: Find Frequent Itemsets of Size 1
Item Support (Frequency)
Bread 4/5 = 0.80
Butter 4/5 = 0.80
Milk 4/5 = 0.80
Jam 1/5 = 0.20
 Minimum support threshold = 60% (0.60)

 Frequent itemsets of size 1: {Bread}, {Butter}, {Milk}
Candidate pairs from the frequent size-1 itemsets:
 {Bread, Butter}
 {Bread, Milk}
 {Butter, Milk}
Support counts:
Itemset Support (Frequency)
Bread, Butter 3/5 = 0.60
Bread, Milk 3/5 = 0.60
Butter, Milk 3/5 = 0.60
 All pairs are frequent because their support is ≥ 60%.
 Candidate triplet: {Bread, Butter, Milk}
Support count:
Itemset Support (Frequency)
Bread, Butter, Milk 2/5 = 0.40
 The support of {Bread, Butter, Milk} is 40%, which is less than the minimum support threshold
of 60%, so this itemset is not frequent.
Step 4: Generate Association Rules
 Generate rules from the frequent itemsets of size 2.
For example, from {Bread, Butter}, the association rules are:
 Bread → Butter with confidence = 3/4 = 0.75
 Butter → Bread with confidence = 3/4 = 0.75
Similarly, from {Bread, Milk}:

 Bread → Milk with confidence = 3/4 = 0.75
 Milk → Bread with confidence = 3/4 = 0.75
And from {Butter, Milk}:
 Butter → Milk with confidence = 3/4 = 0.75
 Milk → Butter with confidence = 3/4 = 0.75
Final Frequent Itemsets and Association Rules:
 Frequent Itemsets: {Bread}, {Butter}, {Milk}, {Bread, Butter}, {Bread, Milk}, {Butter, Milk}
 Association Rules:
o Bread → Butter (Confidence: 75%)
o Butter → Bread (Confidence: 75%)
o Bread → Milk (Confidence: 75%)
o Milk → Bread (Confidence: 75%)
o Butter → Milk (Confidence: 75%)
o Milk → Butter (Confidence: 75%)
Advantages of the Apriori Algorithm:
 Simplicity: The algorithm is simple to understand and implement.
 Pruning: It reduces the search space using the Apriori principle to prune non-frequent itemsets.
 Efficiency: It can handle large datasets and is widely used for market basket analysis and similar
tasks.
Disadvantages of the Apriori Algorithm:
 Combinatorial Explosion: The algorithm can suffer from combinatorial explosion when dealing
with large datasets because it generates a large number of candidate itemsets.
 Multiple Scans of Dataset: The algorithm requires multiple scans of the dataset (one per level of
itemsets), which can be inefficient for very large datasets.
FP-Growth Algorithm :
The FP-Growth (Frequent Pattern Growth) algorithm is an efficient and scalable method for
mining frequent itemsets in large datasets. Unlike the Apriori algorithm, which generates
candidate itemsets and uses multiple database scans, FP-Growth works by building a compact
data structure called the FP-tree and then recursively mining the frequent itemsets directly
from this tree. The FP-Growth algorithm is faster and more efficient because it avoids candidate
generation and the repetitive database scanning done in Apriori.
How FP-Growth Works:

FP-Growth uses a divide-and-conquer approach and can be broken down into two major steps:
1. Step 1: Construct the FP-Tree
2. Step 2: Mine the FP-Tree for Frequent Itemsets
Step 1: Constructing the FP-Tree
The FP-Tree (Frequent Pattern Tree) is a compact data structure that represents the transactional
database. It stores the frequent itemsets in a compressed format. Here’s how it is constructed:
1. Scan the Dataset: The algorithm scans the dataset to calculate the support of all items
and identifies the frequent items (those with support above the minimum threshold).
2. Sort Items: The frequent items are sorted in descending order of their support, and each
transaction is rewritten by only keeping the frequent items (excluding infrequent ones).
3. Build the FP-Tree:
o The FP-Tree starts with a root node (denoted as “null”).
o For each transaction, the items are placed in the tree in the order of their frequency (from
the most frequent to least frequent), and a path is created.
o Each time a new transaction shares the same prefix (starting sequence of items) as an
existing path, a new node is added to the existing path, and the count for that path is
incremented.
o If a transaction doesn't share any existing path, a new branch is created in the tree.
o The nodes in the tree represent items, and edges represent the co-occurrence between
items in the transaction dataset.
4. Store Item-Conditional Patterns: For each frequent item, FP-Growth builds a
conditional pattern base (a sub-database consisting of transactions that contain the
item), and a corresponding conditional FP-tree is constructed.
Step 2: Mining the FP-Tree for Frequent Itemsets
Once the FP-Tree is constructed, the next step is to mine the frequent itemsets from the tree. This
is done by recursively traversing the FP-Tree:
1. Start from the bottom (leaf nodes): Start by identifying the frequent items (those that
meet the minimum support threshold).
2. Conditional Pattern Base: For each frequent item, construct its conditional pattern
base by gathering all paths in the tree that end with the item. These paths form the
"context" in which the item appears.
3. Recursive Mining: Recursively mine the conditional pattern base (sub-tree) for frequent
itemsets. For each frequent item, create a conditional FP-tree and repeat the process.
The process continues recursively until all frequent itemsets have been discovered.
Advantages of FP-Growth over Apriori:

1. Efficiency: FP-Growth avoids generating candidate itemsets and performs fewer database scans
(only two passes are needed).
2. Compactness: FP-Growth uses a tree structure, which is more compact and reduces memory
consumption compared to storing all transactions.
3. Scalability: It can handle larger datasets more efficiently than the Apriori algorithm.
4. No Candidate Generation: Unlike Apriori, FP-Growth does not require candidate generation,
which can be computationally expensive and lead to combinatorial explosion.
Example of FP-Growth Algorithm:
Let’s take a small dataset of transactions to illustrate how FP-Growth works.
Dataset:
1 Bread, Milk, Butter
2 Bread, Milk
3 Bread, Butter
4 Milk, Butter
Step 1: Construct the FP-Tree
1. First Scan: Count the frequency of each item:
Item Support (Frequency)
Bread 4
Milk 4
Butter 4
 Frequent items: All items (Bread, Milk, Butter) have a support of 80%, which is above the
minimum support threshold (60%).
2. Second Scan: Sort the frequent items in descending order of support. In this case, they all have
the same support, so we can choose any order. Let’s say we order them as: Bread, Milk, Butter.

 Now we rewrite the transactions by only including frequent items and sorting them according to
the chosen order.
Transaction ID Sorted Items
2 Bread, Milk
3 Bread, Butter
4 Milk, Butter
3. Construct the FP-Tree:
o We start with the root node (null).
o For Transaction 1: Bread → Milk → Butter (create a path).
o For Transaction 2: Bread → Milk (extend the path).
o For Transaction 3: Bread → Butter (new path).
o For Transaction 4: Milk → Butter (new path).
o For Transaction 5: Bread → Milk → Butter (extend the path).
The resulting FP-Tree would look something like this:
Step 2: Mine the FP-Tree
1. Start with Butter: The conditional pattern base for Butter is the set of paths that end with
Butter:
o {Bread, Milk} → support count = 3
o {Milk} → support count = 2
o Conditional FP-Tree for Butter: {Bread, Milk} → support = 3, {Milk} → support = 2
o Frequent itemsets: {Butter}, {Bread, Milk, Butter}
2. Next, Milk: The conditional pattern base for Milk is:

o {Bread, Butter} → support count = 3
o Conditional FP-Tree for Milk: {Bread, Butter} → support = 3
o Frequent itemsets: {Milk}, {Bread, Milk}
3. Finally, Bread: The conditional pattern base for Bread is:
o {Milk, Butter} → support count = 3
o Conditional FP-Tree for Bread: {Milk, Butter} → support = 3
o Frequent itemsets: {Bread}, {Bread, Milk}
 {Bread}
 {Milk}
 {Butter}
 {Bread, Milk}
 {Bread, Butter}
 {Bread, Milk, Butter}

Data Warehouse and Data Mining Unit III.docx

More Related Content

Similar to Data Warehouse and Data Mining Unit III.docx

Recently uploaded

Data Warehouse and Data Mining Unit III.docx