Frequent Pattern Analysis, Apriori and FP Growth Algorithm

Sanjivani Rural Education Society’s
Sanjivani College of Engineering, Kopargaon-423 603
(An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune)
NACC ‘A’ Grade Accredited, ISO 9001:2015 Certified
Department of Computer Engineering
(NBA Accredited)
Prof. S. A. Shivarkar
Assistant Professor
Contact No.8275032712
Email- shivarkarsandipcomp@sanjivani.org.in
Subject- Data Mining and Warehousing (CO314)
Unit –V: Frequent Pattern Analysis

Content
 Market Basket Analysis, Frequent item set, closed item set & Association
Rules, mining multilevel association rules, constraint based association rule
mining
 Generating Association Rules from Frequent Item sets, Apriori Algorithm,
Improving the Efficiency of Apriori, FP Growth Algorithm
 Mining Various Kinds of Association Rules: Mining multilevel association
rules, constraint based association rule mining, Meta rule-Guided Mining of
Association Rules.

What Is Frequent Pattern Analysis?
 Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
 Motivation: Finding inherent regularities in data
 What products were often purchased together?— Beer and diapers?!
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
 Applications
 Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.

Why is Frequent Pattern Analysis Important?
 Freq. pattern: An intrinsic and important property of
datasets
 Foundation for many essential data mining tasks
 Association, correlation, and causality analysis
 Sequential, structural (e.g., sub-graph) patterns
 Pattern analysis in spatiotemporal, multimedia, time-
series, and stream data
 Classification: discriminative, frequent pattern analysis
 Cluster analysis: frequent pattern-based clustering
 Data warehousing: iceberg cube and cube-gradient
 Semantic data compression: fascicles
 Broad applications

Basic Concepts: Frequent Patterns
Customer
buys diaper
Customer
buys both
Customer
buys beer
Tid Items bought
10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk
 itemset: A set of one or more
items
 k-itemset X = {x1, …, xk}
 (absolute) support, or, support
count of X: Frequency or
occurrence of an itemset X
 (relative) support, s, is the
fraction of transactions that
contains X (i.e., the probability
that a transaction contains X)
 An itemset X is frequent if X’s
support is no less than a minsup
threshold

Basic Concepts: Association Rules
Customer
buys diaper
Customer
buys both
Customer
buys beer
Tid Items bought
10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk
 Find all the rules X  Y with
minimum support and confidence
 support, s, probability that a
transaction contains X  Y
 confidence, c, conditional
probability that a transaction
having X also contains Y
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
{Beer, Diaper}:3
 Association rules: (many more!)
 Beer  Diaper (60%, 100%)
 Diaper  Beer (60%, 75%)

The Downward Closure Property and Scalable Mining Methods
 The downward closure property of frequent patterns
 Any subset of a frequent itemset must be frequent
 If {beer, diaper, nuts} is frequent, so is {beer, diaper}
i.e., every transaction having {beer, diaper, nuts} also contains
{beer, diaper}
 Scalable mining methods: Three major approaches
 Apriori (Agrawal & Srikant@VLDB’94)
 Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)
 Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)

Apriori: A Candidate Generation & Test Approach
 Apriori pruning principle: If there is any itemset which is infrequent, its
superset should not be generated/tested! (Agrawal & Srikant
@VLDB’94, Mannila, et al. @ KDD’ 94)
 Method:
 Initially, scan DB once to get frequent 1-itemset
 Generate length (k+1) candidate itemsets from length k frequent
itemsets
 Test the candidates against DB
 Terminate when no frequent or candidate set can be generated

The Apriori Algorithm—An Example 1

A database has five transactions. Let min sup D 60% and min conf D 80%.
(a) Find all frequent itemsets using Apriori and FP-growth, respectively. Compare the
efficiency of the two mining processes.
(b) List all the strong association rules with support s and confidence c

The Apriori Algorithm Pseudo-Code
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;

Implementation of Apriori
 How to generate candidates?
 Step 1: self-joining Lk
 Step 2: pruning
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4 = {abcd}

Further Challenges and Improvement in Apriori
 Major computational challenges
 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates

Improving the Efficiency of Apriori
 Hash-based technique
 Hashing itemsets into corresponding buckets
 A hash-based technique can be used to reduce the size of the candidate k-itemsets, Ck,
for k > 1
 Transaction reduction
 Reducing the number of transactions scanned in future iterations
 A transaction that does not contain any frequent k-itemsets cannot contain any frequent
(k C1)-itemsets.
 Therefore, such a transaction can be marked or removed from further consideration
because subsequent database scans for j-itemsets, where j > k, will not need to
consider such a transaction.

Improving the Efficiency of Apriori
 Partitioning
 Partitioning the data to find candidate itemsets
 Sampling
 Mining on a subset of the given data
 The basic idea of the sampling approach is to pick a random sample S of the given data
D, and then search for frequent itemsets in S instead of D. In this way, we trade off
some degree of accuracy against efficiency
 Dynamic item set counting
 Adding candidate itemsets at different points during a scan

Pattern-Growth Approach: Mining Frequent Patterns
Without Candidate Generation
 Bottlenecks of the Apriori approach
 Breadth-first (i.e., level-wise) search
 Candidate generation and test
 Often generates a huge number of candidates
 The FP Growth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
 Depth-first search
 Avoid explicit candidate generation
 Major philosophy: Grow long patterns from short ones using local
frequent items only
 “abc” is a frequent pattern
 Get all transactions having “abc”, i.e., project DB on abc: DB|abc
 “d” is a local frequent item in DB|abc  abcd is a frequent pattern

Construct FP-tree from a Transaction Database Example 1

Construct FP-tree from a Transaction Database Example 1 cont…

Construct FP-tree from a Transaction Database Example 2

Mining Various Kinds of Association Rules: Mining
multilevel association rules
 Pattern Mining in Multilevel, Multidimensional Space
 Multilevel associations involve concepts at different abstraction levels.
 Multidimensional associations involve more than one dimension or
predicate (e.g., rules that relate what a customer buys to his or her
age).
 Quantitative association rules involve numeric attributes that have an
implicit ordering among values (e.g., age).

Mining Multilevel Associations
 For many applications, strong associations discovered at high abstraction
levels, though with high support
 May want to drill down to find novel patterns at more detailed levels.
 A concept hierarchy defines a sequence of mappings from a set of low-level concepts
to a higher-level, more general concept set
 Data can be generalized by replacing low-level concepts within the data by their
corresponding higher-level concepts, or ancestors, from a concept hierarchy.
 Association rules generated from mining data at multiple abstraction levels are called
multiple-level or multilevel association
 Multilevel association rules can be mined efficiently using concept hierarchies under a
support-confidence framework.

Mining Multilevel Associations

Mining Multidimensional Associations: Multilevel
mining with uniform support

Mining Multidimensional Associations: Multilevel
mining with reduced support.

Constraint-Based Frequent Pattern Mining
 Constraint based association rule mining aims to develop a systematic method
by which the user can find important association among items in a database of
transactions.
 The users specify intuition or expectations as constraints to confine the search
space.
 This strategy is known as constraint-based mining. The constraints can include
the following:
 Knowledge type constraints
 Data constraints
 Interestingness constraints
 Rule constraints

 Knowledge type constraints: These specify the type of knowledge to be
mined, such as association, correlation, classification, or clustering.
 Data constraints: These specify the set of task-relevant data.
 Dimension/level constraints: These specify the desired dimensions (or
attributes) of the data, the abstraction levels, or the level of the concept
hierarchies to be used in mining.
 Interestingness constraints: These specify thresholds on statistical measures of
rule interestingness such as support, confidence, and correlation.

 Rule constraints: These specify the form of, or conditions on, the rules to be
mined. Such constraints may be expressed as meta rules (rule templates), as
the maximum or minimum number of predicates that can occur in the rule
antecedent or consequent, or as relationships among attributes, attribute
values, and/or aggregates.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 31
Reference
 Han, Jiawei Kamber, Micheline Pei and Jian, “Data Mining: Concepts and
Techniques”,Elsevier Publishers, ISBN:9780123814791, 9780123814807.
 https://onlinecourses.nptel.ac.in/noc24_cs22

Frequent Pattern Analysis, Apriori and FP Growth Algorithm

Recommended

Recommended

More Related Content

Similar to Frequent Pattern Analysis, Apriori and FP Growth Algorithm

Similar to Frequent Pattern Analysis, Apriori and FP Growth Algorithm (20)

More from ShivarkarSandip

More from ShivarkarSandip (20)

Recently uploaded

Recently uploaded (20)

Frequent Pattern Analysis, Apriori and FP Growth Algorithm