Frequent Pattern Analysis, Apriori and FP Growth Algorithm
1. Sanjivani Rural Education Society’s
Sanjivani College of Engineering, Kopargaon-423 603
(An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune)
NACC ‘A’ Grade Accredited, ISO 9001:2015 Certified
Department of Computer Engineering
(NBA Accredited)
Prof. S. A. Shivarkar
Assistant Professor
Contact No.8275032712
Email- shivarkarsandipcomp@sanjivani.org.in
Subject- Data Mining and Warehousing (CO314)
Unit –V: Frequent Pattern Analysis
2. Content
Market Basket Analysis, Frequent item set, closed item set & Association
Rules, mining multilevel association rules, constraint based association rule
mining
Generating Association Rules from Frequent Item sets, Apriori Algorithm,
Improving the Efficiency of Apriori, FP Growth Algorithm
Mining Various Kinds of Association Rules: Mining multilevel association
rules, constraint based association rule mining, Meta rule-Guided Mining of
Association Rules.
3. What Is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
Motivation: Finding inherent regularities in data
What products were often purchased together?— Beer and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
4. Why is Frequent Pattern Analysis Important?
Freq. pattern: An intrinsic and important property of
datasets
Foundation for many essential data mining tasks
Association, correlation, and causality analysis
Sequential, structural (e.g., sub-graph) patterns
Pattern analysis in spatiotemporal, multimedia, time-
series, and stream data
Classification: discriminative, frequent pattern analysis
Cluster analysis: frequent pattern-based clustering
Data warehousing: iceberg cube and cube-gradient
Semantic data compression: fascicles
Broad applications
5. Basic Concepts: Frequent Patterns
Customer
buys diaper
Customer
buys both
Customer
buys beer
Tid Items bought
10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk
itemset: A set of one or more
items
k-itemset X = {x1, …, xk}
(absolute) support, or, support
count of X: Frequency or
occurrence of an itemset X
(relative) support, s, is the
fraction of transactions that
contains X (i.e., the probability
that a transaction contains X)
An itemset X is frequent if X’s
support is no less than a minsup
threshold
6. Basic Concepts: Association Rules
Customer
buys diaper
Customer
buys both
Customer
buys beer
Tid Items bought
10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk
Find all the rules X Y with
minimum support and confidence
support, s, probability that a
transaction contains X Y
confidence, c, conditional
probability that a transaction
having X also contains Y
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
{Beer, Diaper}:3
Association rules: (many more!)
Beer Diaper (60%, 100%)
Diaper Beer (60%, 75%)
7. The Downward Closure Property and Scalable Mining Methods
The downward closure property of frequent patterns
Any subset of a frequent itemset must be frequent
If {beer, diaper, nuts} is frequent, so is {beer, diaper}
i.e., every transaction having {beer, diaper, nuts} also contains
{beer, diaper}
Scalable mining methods: Three major approaches
Apriori (Agrawal & Srikant@VLDB’94)
Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)
Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
8. Apriori: A Candidate Generation & Test Approach
Apriori pruning principle: If there is any itemset which is infrequent, its
superset should not be generated/tested! (Agrawal & Srikant
@VLDB’94, Mannila, et al. @ KDD’ 94)
Method:
Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from length k frequent
itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can be generated
11. The Apriori Algorithm—An Example 3
A database has five transactions. Let min sup D 60% and min conf D 80%.
(a) Find all frequent itemsets using Apriori and FP-growth, respectively. Compare the
efficiency of the two mining processes.
(b) List all the strong association rules with support s and confidence c
12. The Apriori Algorithm Pseudo-Code
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
13. Implementation of Apriori
How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4 = {abcd}
14. Further Challenges and Improvement in Apriori
Major computational challenges
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for candidates
Improving Apriori: general ideas
Reduce passes of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates
15. Improving the Efficiency of Apriori
Hash-based technique
Hashing itemsets into corresponding buckets
A hash-based technique can be used to reduce the size of the candidate k-itemsets, Ck,
for k > 1
Transaction reduction
Reducing the number of transactions scanned in future iterations
A transaction that does not contain any frequent k-itemsets cannot contain any frequent
(k C1)-itemsets.
Therefore, such a transaction can be marked or removed from further consideration
because subsequent database scans for j-itemsets, where j > k, will not need to
consider such a transaction.
16. Improving the Efficiency of Apriori
Partitioning
Partitioning the data to find candidate itemsets
Sampling
Mining on a subset of the given data
The basic idea of the sampling approach is to pick a random sample S of the given data
D, and then search for frequent itemsets in S instead of D. In this way, we trade off
some degree of accuracy against efficiency
Dynamic item set counting
Adding candidate itemsets at different points during a scan
17. Pattern-Growth Approach: Mining Frequent Patterns
Without Candidate Generation
Bottlenecks of the Apriori approach
Breadth-first (i.e., level-wise) search
Candidate generation and test
Often generates a huge number of candidates
The FP Growth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
Depth-first search
Avoid explicit candidate generation
Major philosophy: Grow long patterns from short ones using local
frequent items only
“abc” is a frequent pattern
Get all transactions having “abc”, i.e., project DB on abc: DB|abc
“d” is a local frequent item in DB|abc abcd is a frequent pattern
23. Mining Various Kinds of Association Rules: Mining
multilevel association rules
Pattern Mining in Multilevel, Multidimensional Space
Multilevel associations involve concepts at different abstraction levels.
Multidimensional associations involve more than one dimension or
predicate (e.g., rules that relate what a customer buys to his or her
age).
Quantitative association rules involve numeric attributes that have an
implicit ordering among values (e.g., age).
24. Mining Multilevel Associations
For many applications, strong associations discovered at high abstraction
levels, though with high support
May want to drill down to find novel patterns at more detailed levels.
A concept hierarchy defines a sequence of mappings from a set of low-level concepts
to a higher-level, more general concept set
Data can be generalized by replacing low-level concepts within the data by their
corresponding higher-level concepts, or ancestors, from a concept hierarchy.
Association rules generated from mining data at multiple abstraction levels are called
multiple-level or multilevel association
Multilevel association rules can be mined efficiently using concept hierarchies under a
support-confidence framework.
28. Constraint-Based Frequent Pattern Mining
Constraint based association rule mining aims to develop a systematic method
by which the user can find important association among items in a database of
transactions.
The users specify intuition or expectations as constraints to confine the search
space.
This strategy is known as constraint-based mining. The constraints can include
the following:
Knowledge type constraints
Data constraints
Interestingness constraints
Rule constraints
29. Constraint-Based Frequent Pattern Mining
Knowledge type constraints: These specify the type of knowledge to be
mined, such as association, correlation, classification, or clustering.
Data constraints: These specify the set of task-relevant data.
Dimension/level constraints: These specify the desired dimensions (or
attributes) of the data, the abstraction levels, or the level of the concept
hierarchies to be used in mining.
Interestingness constraints: These specify thresholds on statistical measures of
rule interestingness such as support, confidence, and correlation.
30. Constraint-Based Frequent Pattern Mining
Rule constraints: These specify the form of, or conditions on, the rules to be
mined. Such constraints may be expressed as meta rules (rule templates), as
the maximum or minimum number of predicates that can occur in the rule
antecedent or consequent, or as relationships among attributes, attribute
values, and/or aggregates.
31. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 31
Reference
Han, Jiawei Kamber, Micheline Pei and Jian, “Data Mining: Concepts and
Techniques”,Elsevier Publishers, ISBN:9780123814791, 9780123814807.
https://onlinecourses.nptel.ac.in/noc24_cs22