6 module 4

Module - 4
By
Dr.Ramkumar.T
ramkumar.thirunavukarasu@vit.ac.in
Mining Frequent Patterns, Associations and Correlations

2
What Is Frequent Pattern
Analysis?
• Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that
occurs frequently in a data set
• First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of
frequent itemsets and association rule mining
• Motivation: Finding inherent regularities in data
– What products were often purchased together?— Beer and diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
• Applications
– Basket data analysis, cross-marketing, catalog design, sale campaign analysis,
Web log (click stream) analysis, and DNA sequence analysis.

Association Rule Mining
• Proposed by Agrawal R, Imielinski T, and Swami AN –
"Mining Association Rules between Sets of Items in Large
Databases.‖ – SIGMOD, June 1993
• An important data mining model studied extensively by the
database and data mining community
• Initially used for Market Basket Analysis to find how items
purchased by customers are related
• Assumes all data are categorical
• Not a good algorithm for numeric data

Basic Terms
• An item - an item/article in a basket
• Itemset – Collection of one or more items
I = {i1, i2, …, im}
• k-itemset - An itemset that contains k-items
{Bread} - 1 Itemset
{Bread,Milk} – 2 Itemset
{Bread,Milk, Diaper} – 3 Itemset

Basic Terms
• Transaction (T) - items purchased in a basket, Contains set of
items
• A transactional dataset - A set of transactions, Contains
{T1, T2,…, Tn}

Association Rule Measures
• An Association Rule is an implication of the form X  Y,
where X, Y belong to ‗ I ‘
• Example: {Milk, Diaper} → {Beer}
• Objective measures - A data-driven approach for evaluating the
quality of association patterns. It is domain-independent and
requires minimal input from the users
• Subjective measure – Knowledge acquired through domain
expertise
• Widely used objective measures in ARM : Support &
Confidence

Support (s)
• Support measures the frequency of the item/itemset
• The support ‗s‘ of an itemset ‗X‘ is the percentage of transactions in the
transaction database D that contain ‗X‘
• The support ‗s‘ of the rule ‗X  Y‘ in the transaction database ‗D‘ is the
support of the it ems set

Association Rule Mining
Given:
- a set ―I‖ of all the items;
- A database ―D‖ of transactions;
- minimum support – ‗s‘
- minimum confidence – ‗c‘
Find :
- All association rules X  Y that satisfy the
minimum support ‗s‘ and minimum confidence
‗c‘
How?
- Step 1: Frequent Item set Generation - Find all sets of
items that have minimum support (frequent item
sets)
- Step 2: Rule Generation - Generate association rules
that satisfy the minimum confidence thresholds from
the frequent item sets found in the previous step.

Apriori Algorithm
• Apriori algorithm is mining frequent item sets for single dimensional
boolean associations rules
• The name of the algorithm is based on the fact that the algorithm uses
prior knowledge of frequent items
• Applies an iterative approach known as level wise search, where k-
items are used to explore k+1 item
• Apriori property is used to reduce the search space
• Apriori Property – ―All nonempty subset of frequent items must be
also frequent‖ also called as ―Anti-Monotonic‖ property
• Anti-monotone in the sense that if a set cannot pass a test, all its
supper sets will fail the same test as well

Data representation
• Market basket data can be represented in a binary
format.
• Each row corresponds to a transaction and each
column corresponds to an item.
• An item can be treated as a binary variable whose
value is one if the item is present in a transaction
and zero otherwise.

14
Apriori: A Candidate Generation & Test Approach
• Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
• Method:
– Initially, scan DB once to get frequent 1-itemset
– Generate length (k+1) candidate itemsets from length k
frequent itemsets
– Test the candidates against DB
– Terminate when no frequent or candidate set can be
generated

15
The Apriori Algorithm—An Example
Database TDB
1st scan
C1
L1
L2
C2 C2
2nd scan
C3 L3
3rd scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2
{B, E} 3
{C, E} 2
Itemset sup
{A, C} 2
{B, C} 2
{B, E} 3
{C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
Supmin = 2

Generation of Strong Association Rules

17
The Apriori Algorithm (Pseudo-
Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;

18
Implementation of Apriori
• How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
• Example of Candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
– C4 = {abcd}

Maximal Frequent Itemset
• The number of frequent itemsets generated
by the Apriori algorithm can often be very
large.
• Hence it is beneficial to identify a small
representative set from which every
frequent item set can be derived.
• One such approach is using maximal
frequent item sets.

Maximal Frequent Itemset
• A maximal frequent itemset is a frequent
itemset for which none of its immediate
supersets are frequent.
• Eg. {A,C} is the Maximal Frequent Itemset,
where the superset {A,C,E} is in-frequent in
the example problem
• Maximal frequent itemsets provide a
compact representation of all the
frequent itemsets for a particular dataset.

Closed Frequent Itemset
• An itemset is closed in a dataset if there
exists no superset that has the same support
count.
• Eg. {B,E} is the closed frequent itemset
with support count of 3, where its superset
{B,C,E} has the support count of 2.

Improving the efficiency of Apriori

The idea behind Partitioning
Algorithm

FP Tree Mining
 Apriori: uses a generate-and-test approach – generates
candidate itemsets and tests if they are frequent
– Generation of candidate itemsets is expensive(in both space
and time)
– Support counting is expensive
• Subset checking (computationally expensive)
• Multiple Database scans (I/O)
 FP-Growth: allows frequent itemset discovery without
candidate itemset generation. Two step approach:
– Step 1: Build a compact data structure called the FP-tree
• Built using 2 passes over the data-set.
– Step 2: Extracts frequent itemsets directly from the FP-tree

29
Definition of FP-tree
 A frequent pattern tree is defined below.
 It consists of one root labeled as ―null‖, a set of item
prefix subtrees as the children of the root, and a
frequent-item header table.
 Each node in the item prefix subtree consists of three
fields: item-name, count, and node-link.
 Each entry in the frequent-item header table consists
of two fields, (1) item-name and (2) head of node-
link.

30
Frequent Pattern Tree — Example

Step 1: FP-Tree Construction
FP-Tree is constructed using 2 passes over the
data-set:
Pass 1:
– Scan data and find support for each item.
– Discard infrequent items.
– Sort frequent items in decreasing order based on
their support.
Use this order when building the FP-Tree, so
common prefixes can be shared.

Pass 2:
Nodes correspond to items and have a counter
1. FP-Growth reads 1 transaction at a time and maps it to a path
2. Fixed order is used, so paths can overlap when transactions
share items (when they have the same prfix ).
– In this case, counters are incremented
3. Pointers are maintained between nodes containing the same
item, creating singly linked lists (dotted lines)
– The more paths that overlap, the higher the compression. FP-tree
may fit in memory.
4. Frequent itemsets extracted from the FP-Tree.

(Example)

FP-Tree size
 The FP-Tree usually has a smaller size than the uncompressed
data - typically many transactions share items (and hence
prefixes).
– Best case scenario: all transactions contain the same set of items.
• 1 path in the FP-tree
– Worst case scenario: every transaction has a unique set of items
(no items in common)
• Size of the FP-tree is at least as large as the original data.
• Storage requirements for the FP-tree are higher - need to store the pointers
between the nodes and the counters.
 The size of the FP-tree depends on how the items are ordered
 Ordering by decreasing support is typically used but it does not
always lead to the smallest tree (it's a heuristic).

Step 2: Frequent Itemset Generation
FP-Growth extracts frequent itemsets from
the FP-tree.
Bottom-up algorithm - from the leaves
towards the root
Divide and conquer - Decompose both the
mining task and DB according to the
frequent patterns obtained so far

36
Compare Apriori-like method to FP-
tree
• Apriori-like method may generate an exponential
number of candidates in the worst case.
• FP-tree does not generate an exponential number
of nodes.
• The items ordered in the support-descending
order indicate that FP-tree structure is usually
highly compact.

FP Tree Pros and Cons
Advantages of FP-Growth
– only 2 passes over data-set
– ―compresses‖ data-set
– no candidate generation
– much faster than Apriori
Disadvantages of FP-Growth
– FP-Tree may not fit in memory!!
– FP-Tree is expensive to build

38
Construct FP-tree from a Transaction Database
 Let the minimum support be 20%
1. Scan DB once, find frequent 1-itemset
(single item pattern)
2. Sort frequent items in frequency
descending order, f-list
3. Scan DB again, construct FP-tree
Frequent 1-itemset Support Count
I1 6
I2 7
I3 6
I4 2
I5 2

39
TID Items bought (ordered) frequent items
T100 {I1, I2, I5} {I2, I1, I5}
T200 {I2, I4} {I2, I4}
T300 {I2, I3} {I2, I3}
T400 {I1, I2, I4} {I2, I1, I4}
T500 {I1, I3} {I1, I3}
T600 {I2, I3} {I2, I3}
T700 {I1,I3} {I1,I3}
T800 {I1, I2, I3, I5} {I2, I1, I3, I5}
T900 {I1, I2, I5} {I2, I1, I5}

40

41
Why Is FP-Growth the Winner?
• Divide-and-conquer:
– leads to focused search of smaller databases
• Other factors
– no candidate generation, no candidate test
– compressed database: FP-tree structure
– no repeated scan of entire database
– basic operations—counting local frequent items and
building sub FP-tree, no pattern search and matching

Evaluation of Association Rules
• Association rule algorithms tend to produce too many rules – many
of them are ―uninteresting‖
• A rule is considered subjectively uninteresting unless it
• reveals unexpected information about the data, or
• provides useful knowledge that can lead to profitable actions.
• How can we know some rules are not good (even with high
confidence and support)?
• Leads to define various measures Used to define various measures •
Traditionally: support, confidence
New: lift, etc.

Strong rules are not necessarily Interesting
• Support and Confidence measures are insufficient in filtering
out un-interestingness association rules
• To tackle the weakness, a correlation measure can be used to
augment the support-confidence framework
• Measure of dependent/correlated events:
• Hence a typical association rule can be represented in the form
of
A  B { support,confidence,Correlation}

Contingency Table (or) Confusion Matrix

Read the scenario
• Consider a population of 100 people in which there are 50
researchers and 50 non-researchers. 80 out of the 100 people are
coffee drinkers and 20 are coffee abstainers. Suppose 35
researchers drink coffee and the remaining 15 do not. It follows
that 45 non-researchers drink coffee and remaining 5 do not.
• How to Form contingency Table ?

October 4, 2021 Data Mining: Concepts and Techniques 46
Contingency Table for the Problem
Calculate the Support and Confidence of (Researcher  coffee drinker)
Calculate the support and confidence of (Researcher  coffee abstainer)
Coffee-
Drinker
Coffee-
abstainer
Sum
(row)
Researcher 35 15 50
Non-
Researcher
45 5 50
Sum(col.) 80 20 100

47
Interestingness Measure: Correlations (Lift)
if Corr A,B = 1 THEN ‗A‘ and ‗B‘ are independent of each other
if Corr A,B > 1 THEN ‗A‘ and ‗B‘ are positively correlated
if Corr A,B < 1 THEN ‗A‘ and ‗B‘ are Negatively correlated
Corr Researcher,Coffe drinker = 0.35/ (0.50) (0.80)
= 0.35/0.40
= 0.875
I
)
(
)
(
)
(
,
B
P
A
P
B
A
P
corr B
A



October 4, 2021 Data Mining: Concepts and Techniques 48
Answer It
• Suppose we are interested in analyzing transaction of All
electronics with respect to purchase of computer games and
videos. Of the 10,000 transaction analyzed, the data show that
6000 of the customer transaction included computer games,
while 7500 included videos and 4000 included both computer
games and videos.
• Check whether the rule computer game  Video is really
interesting for a minimum support of 0.30 and minimum
confidence of 0.66.

Multiple-level Association Rules
• Items often form hierarchy
• Flexible support settings: Items at the lower level are expected
to have lower support.
• Transaction database can be encoded based on dimensions and
levels
uniform support
Milk
[support = 10%]
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
Level 1
min_sup = 5%
Level 2
min_sup = 5%
Level 1
min_sup = 5%
Level 2
min_sup = 3%
reduced support

ML/MD Associations with Flexible Support Constraints
• Why flexible support constraints?
– Real life occurrence frequencies vary greatly
• Diamond, watch, pens in a shopping basket
– Uniform support may not be an interesting model
• A flexible model
– The lower-level, the more dimension combination, and the long pattern
length, usually the smaller support
– General rules should be easy to specify and understand
– Special items and special group of items may be specified individually
and have higher priority

Multi-dimensional Association
• Single-dimensional rules:
buys(X, ―milk‖)  buys(X, ―bread‖)
• Multi-dimensional rules:  2 dimensions or predicates
– Inter-dimension assoc. rules (no repeated predicates)
age(X,‖19-25‖)  occupation(X,―student‖)  buys(X,―coke‖)
– hybrid-dimension assoc. rules (repeated predicates)
age(X,‖19-25‖)  buys(X, ―popcorn‖)  buys(X, ―coke‖)
• Categorical Attributes
– finite number of possible values, no ordering among values%
• Quantitative Attributes
– numeric, implicit ordering among values

52
Techniques for Mining MD Associations
• Search for frequent k-predicate set:
– Example: {age, occupation, buys} is a 3-predicate set
– Techniques can be categorized by how age are treated
1. Using static discretization of quantitative attributes
– Quantitative attributes are statically discretized by using
predefined concept hierarchies
2. Quantitative association rules
– Quantitative attributes are dynamically discretized into
―bins‖ based on the distribution of the data
3. Distance-based association rules
– This is a dynamic discretization process that considers the
distance between data points

Data Mining: Concepts and Techniques 53
Quantitative Association Rules
age(X,‖30-34‖) 
income(X,‖24K - 48K‖)
 buys(X,‖high
resolution TV‖)
 Numeric attributes are dynamically discretized
 Such that the confidence or compactness of the rules mined
is maximized
 2-D quantitative association rules: Aquan1  Aquan2  Acat

Usage of Binning Methods
• Binning methods do not capture the semantics of interval data
• Distance-based partitioning, more meaningful discretization
considering:
– density/number of points in an interval
– ―closeness‖ of points in an interval
Price($)
Equi-width
(width $10)
Equi-depth
(depth 2)
Distance-
based
7 [0,10] [7,20] [7,7]
20 [11,20] [22,50] [20,22]
22 [21,30] [51,53] [50,53]
50 [31,40]
51 [41,50]
53 [51,60]

6 module 4

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 6 module 4

Similar to 6 module 4 (20)

Recently uploaded

Recently uploaded (20)

6 module 4