Data Mining Association Analysis Basic Concepts a

Data Mining
Association Analysis: Basic Concepts
and Algorithms
Lecture Notes for Chapter 6
Introduction to Data Mining
by
Tan, Steinbach, Kumar
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Association Rule MiningGiven a set of transactions, find rules
that will predict the occurrence of an item based on the
occurrences of other items in the transaction
Market-Basket transactions
Example of Association Rules

{Beer, Bread
Implication means co-occurrence, not causality!
2002
TIDItems
1
Bread, Milk
2
Bread, Diaper, Beer, Eggs
3
Milk, Diaper, Beer, Coke
4
Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
Definition: Frequent ItemsetItemsetA collection of one or more
itemsExample: {Milk, Bread, Diaper}k-itemsetAn itemset that
of transactions that contain an itemsetE.g. s({Milk, Bread,
Diaper}) = 2/5Frequent ItemsetAn itemset whose support is
greater than or equal to a minsup threshold
2002
TIDItems
1
Bread, Milk
2
3
4

5
Definition: Association RuleAssociation RuleAn implication
itemsetsExample:
Rule Evaluation MetricsSupport (s)Fraction of transactions that
contain both X and YConfidence (c)Measures how often items
in Y
appear in transactions that
contain X
2002
TIDItems
1
Bread, Milk
2
3
4
5
Association Rule Mining TaskGiven a set of transactions T, the
goal of association rule mining is to find all rules having

support ≥ minsup thresholdconfidence ≥ minconf threshold
Brute-force approach:List all possible association rulesCompute
the support and confidence for each rulePrune rules that fail the
minsup and minconf thresholds
2002
Mining Association Rules
Example of Rules:
} (s=0.4, c=0.5)
Observations: All the above rules are binary partitions of the
same itemset:
{Milk, Diaper, Beer} Rules originating from the same
itemset have identical support but
can have different confidence Thus, we may decouple the
support and confidence requirements
2002
TIDItems
1
Bread, Milk
2

3
4
5
Mining Association RulesTwo-step approach:
Frequent Itemset Generation
Rule Generation
Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent i temset
Frequent itemset generation is still computationally expensive
2002
Frequent Itemset Generation
Given d items, there are 2d possible candidate itemsets
2002
Frequent Itemset GenerationBrute-force approach: Each itemset
in the lattice is a candidate frequent itemsetCount the support of
each candidate by scanning the database

Match each transaction against every candidateComplexity ~
O(NMw) => Expensive since M = 2d !!!
2002
N�
w�
M�
List of Candidates�
Computational ComplexityGiven d unique items:Total number
of itemsets = 2dTotal number of possible association rules:
If d=6, R = 602 rules
2002
Frequent Itemset Generation StrategiesReduce the number of
candidates (M)Complete search: M=2dUse pruning techniques
to reduce M
Reduce the number of transactions (N)Reduce size of N as the
size of itemset increasesUsed by DHP and vertical-based mining
algorithms
Reduce the number of comparisons (NM)Use efficient data
structures to store the candidates or transactionsNo need to
match every candidate against every transaction
2002

Reducing Number of CandidatesApriori principle:If an itemset
is frequent, then all of its subsets must also be frequent
Apriori principle holds due to the following property of the
support measure:
Support of an itemset never exceeds the support of its
subsetsThis is known as the anti-monotone property of support
2002
Illustrating Apriori Principle
2002
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
Minimum Support = 3
If every subset is considered,
6C1 + 6C2 + 6C3 = 41
With support-based pruning,
6 + 6 + 1 = 13

2002
ItemCount
Bread
4
Coke
2
Milk
4
Beer
3
Diaper
4
Eggs
1
ItemsetCount
{Bread,Milk}
3
{Bread,Beer}
2
{Bread,Diaper}
3

{Milk,Beer}
2
{Milk,Diaper}
3
{Beer,Diaper}
3
ItemsetCount
{Bread,Milk,Diaper}
3
Apriori AlgorithmMethod:
Let k=1Generate frequent itemsets of length 1Repeat until no
new frequent itemsets are identifiedGenerate length (k+1)
candidate itemsets from length k frequent itemsetsPrune
candidate itemsets containing subsets of length k that are
infrequent Count the support of each candidate by scanning the
DBEliminate candidates that are infrequent, leaving only those
that are frequent
2002
Reducing Number of ComparisonsCandidate counting:Scan the
database of transactions to determine the support of each
candidate itemsetTo reduce the number of comparisons, store
the candidates in a hash structure Instead of matching each
transaction against every candidate, match it against candidates
contained in the hashed buckets

2002
N�
k�
Buckets�
Hash Structure�
Generate Hash Tree
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2
3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6
8}
You need: Hash function Max leaf size: max number of
itemsets stored in a leaf node (if number of candidate i temsets
exceeds max leaf size, split the node)
2002
Association Rule Discovery: Hash tree
1,4,7
2,5,8
3,6,9
Hash Function
Candidate Hash Tree
Hash on 1, 4 or 7
2002

1,4,7
2,5,8
3,6,9
Hash Function
Candidate Hash Tree
Hash on 2, 5 or 8
2002
1,4,7
2,5,8
3,6,9
Hash Function
Candidate Hash Tree
Hash on 3, 6 or 9
2002
Subset Operation
Given a transaction t, what are the possible subsets of size 3?
2002
1 2 3 5 6�
Transaction, t�
2 3 5 6�
3 5 6�
2�

1�
5 6�
1 3�
3 5 6�
1 2�
6�
1 5�
5 6�
2 3�
6�
2 5�
5 6�
3�
1 2 3�1 2 5�1 2 6�
1 3 5�1 3 6�
1 5 6�
2 3 5�2 3 6�
2 5 6�
3 5 6�
Subsets of 3 items�
Level 1�
Level 2�
Level 3�
6�
3 5�
Subset Operation Using Hash Tree
transaction
2002

1 5 9
1 3 6
3 4 5
transaction
2002
1 5 9
1 3 6
3 4 5
transaction
Match transaction against 11 out of 15 candidates
2002
Factors Affecting ComplexityChoice of minimum support
threshold lowering support threshold results in more frequent
itemsets this may increase number of candidates and max length
of frequent itemsetsDimensionality (number of items) of the
data set more space is needed to store support count of each
item if number of frequent items also increases, both
computation and I/O costs may also increaseSize of database
since Apriori makes multiple passes, run time of algorithm may
increase with number of transactionsAverage transaction width
transaction width increases with denser data setsThis may
increase max length of frequent itemsets and traversals of hash
tree (number of subsets in a transaction increases with its
width)
2002

Compact Representation of Frequent ItemsetsSome itemsets are
redundant because they have identical support as their supersets
Number of frequent itemsets
Need a compact representation
2002
Maximal Frequent Itemset
Border
Infrequent Itemsets
Maximal Itemsets
An itemset is maximal frequent if none of its immediate
supersets is frequent
2002
null�
AB�
AC�
AD�
AE�
BC�
BD�
BE�

CD�
CE�
DE�
ABC�
ABD�
ABE�
ACD�
ACE�
ADE�
BCD�
BCE�
BDE�
CDE�
A�
B�
C�
D�
E�
ABCD�
ABCE�
ABDE�
ACDE�
BCDE�
ABCDE�
Closed ItemsetAn itemset is closed if none of its immediate
supersets has the same support as the itemset
2002
Sheet1TIDItems1{A,B}2{B,C,D}3{A,B,C,D}4{A,B,D}5{A,B,C
,D}
Sheet2
Sheet3

Sheet1ItemsetSupport{A}4{B}5{C}3{D}4{A,B}4{A,C}2{A,D}
3{B,C}3{B,D}4{C,D}3
Sheet2
Sheet3
Sheet1ItemsetSupport{A,B,C}2{A,B,D}3{A,C,D}2{B,C,D}3{A
,B,C,D}2
Sheet2
Sheet3
Maximal vs Closed Itemsets
Transaction Ids
Not supported by any transactions
2002
Sheet1TIDItems1ABC2ABCD3BCE4ACDE5DE
Maximal vs Closed Frequent Itemsets
Minimum support = 2
# Closed = 9
# Maximal = 4
Closed and maximal
Closed but not maximal
2002

Maximal vs Closed Itemsets
2002
Frequent Itemsets�
Closed Frequent Itemsets�
Maximal Frequent Itemsets�
Alternative Methods for Frequent Itemset GenerationTraversal
of Itemset LatticeGeneral-to-specific vs Specific-to-general
2002
....�
Frequent itemset border�
null�
{a1,a2,...,an}�
(a) General-to-specific�
....�
null�
{a1,a2,...,an}�
(b) Specific-to-general�
....�
null�
{a1,a2,...,an}�
(c) Bidirectional�

of Itemset LatticeEquivalent Classes
2002
null�
AB�
AC�
AD�
null�
BC�
BD�
AB�
CD�
AC�
AD�
ABC�
ABD�
BC�
ACD�
BD�
CD�
BCD�
A�
B�
C�
A�
B�
C�
D�
D�
ABC�
ABD�
ACD�
BCD�

ABCD�
(a) Prefix tree�
(b) Suffix tree�
ABCD�
of Itemset LatticeBreadth-first vs Depth-first
2002
(a) Breadth first�
(b) Depth first�
Alternative Methods for Frequent Itemset
GenerationRepresentation of Databasehorizontal vs vertical data
layout
2002
Horizontal Data Layout�
Vertical Data Layout�
FP-growth AlgorithmUse a compressed representation of the
database using an FP-tree
Once an FP-tree has been constructed, it uses a recursive
divide-and-conquer approach to mine the frequent itemsets

2002
FP-tree construction
null
A:1
B:1
null
A:1
B:1
B:1
C:1
D:1
After reading TID=1:
After reading TID=2:
2002
Sheet1TIDItems1{A,B}2{B,C,D}3{A,C,D,E}4{A,D,E}5{A,B,C
}6{A,B,C,D}7{B,C}8{A,B,C}9{A,B,D}10{B,C,E}
Sheet2
Sheet3
FP-Tree Construction
null
A:7
B:5
B:3
C:3
D:1
C:1
D:1
C:3

D:1
D:1
E:1
E:1
Pointers are used to assist frequent itemset generati on
D:1
E:1
Transaction Database
Header table
2002
Sheet2
Sheet3
Sheet1ItemPointerABCDE
Sheet2
Sheet3
FP-growth
null
A:7
B:5
B:1
C:1
D:1
C:1
D:1
C:3
D:1
D:1
Conditional Pattern base for D:

P = {(A:1,B:1,C:1),
(A:1,B:1),
(A:1,C:1),
(A:1),
(B:1,C:1)}
Recursively apply FP-growth on P
Frequent Itemsets found (with sup > 1):
AD, BD, CD, ACD, BCD
D:1
2002
Tree Projection
Set enumeration tree:
Possible Extension: E(A) = {B,C,D,E}
Possible Extension: E(ABC) = {D,E}
2002
Tree ProjectionItems are listed in lexicographic orderEach node
P stores the following information:Itemset for node PList of
possible lexicographic extensions of P: E(P)Pointer to projected
database of its ancestor nodeBitvec tor containing information
about which transactions in the projected database contain the
itemset

2002
Projected Database
Original Database:
Projected Database for node A:
For each transaction T,
E(A)
2002
Sheet2
Sheet3
Sheet1TIDItems1{B}2{}3{C,D,E}4{D,E}5{B, C}6{B,C,D}7{}8
{B,C}9{B,D}10{}
Sheet2
Sheet3
ECLATFor each item, store a list of transaction ids (tids)
TID-list
2002

ECLATDetermine support of any k-itemset by intersecting tid-
lists of two of its (k-1) subsets.
3 traversal approaches: top-down, bottom-up and
hybridAdvantage: very fast support countingDisadvantage:
intermediate tid-lists may become too large for memory
2002
Sheet1TIDItemsABCDE1A,B,E112212B,C,D423433C,E554564
A,C,D67895A,B,C,D7896A,E8107A,B98A,B,C9A,C,D10BAB11
425567788109
Sheet2
Sheet3
A,C,D67895A,B,C,D7896A,E8107A,B98A,B,C9A,C,D10BAB11
425567788109
Sheet2
Sheet3
A,C,D67895A,B,C,D7896A,E8107A,B98A,B,C9A,C,D10BABA
B111425557678788109
Sheet2
Sheet3
Rule GenerationGiven a frequent itemset L, find all non-empty

– f satisfies the minimum
confidence requirementIf {A,B,C,D} is a frequent itemset,
candidate rules:
If |L| = k, then there are 2k – 2 candidate association rules
2002
Rule GenerationHow to efficiently generate rules from frequent
itemsets?In general, confidence does not have an anti-monotone
property
But confidence of rules generated from the same itemset has an
anti-monotone propertye.g., L = {A,B,C,D}:
Confidence is anti-monotone w.r.t. number of items on the RHS
of the rule
2002
Rule Generation for Apriori Algorithm
Lattice of rules
Low Confidence Rule

2002
ABCD=>{ }�
BC=>AD�
BD=>AC�
CD=>AB�
AD=>BC�
AC=>BD�
AB=>CD�
D=>ABC�
C=>ABD�
B=>ACD�
A=>BCD�
ACD=>B�
ABD=>C�
ABC=>D�
BCD=>A�
Rule Generation for Apriori AlgorithmCandidate rule is
generated by merging two rules that share the same prefix
in the rule consequent
join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
Prune rule D=>ABC if its
subset AD=>BC does not have
high confidence

2002
Effect of Support DistributionMany real data sets have skewed
support distribution
Support distribution of a retail data set
2002
Effect of Support DistributionHow to set the appropriate minsup
threshold?If minsup is set too high, we could miss itemsets
involving interesting rare items (e.g., expensive products)
If minsup is set too low, it is computationally expensive and the
number of itemsets is very large
Using a single minimum support threshold may not be effective
2002
Multiple Minimum SupportHow to apply multiple minimum
supports?MS(i): minimum support for item i e.g.:
MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%MS({Milk,
Broccoli}) = min (MS(Milk), MS(Broccoli))
= 0.1%
Challenge: Support is no longer anti-monotone Suppose:
Support(Milk, Coke) = 1.5% and
Support(Milk, Coke, Broccoli) = 0.5%

{Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is
frequent
2002
Multiple Minimum Support
2002
Multiple Minimum Support
2002
Multiple Minimum Support (Liu 1999)Order the items according
to their minimum support (in ascending order)e.g.:
MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%Ordering:
Broccoli, Salmon, Coke, Milk
Need to modify Apriori such that:L1 : set of frequent itemsF1 :
where MS(1) is mini( MS(i) )C2 : candidate itemsets
of size 2 is generated from F1
instead of L1
2002

Multiple Minimum Support (Liu 1999)Modifications to
Apriori:In traditional Apriori, A candidate (k+1)-itemset is
generated by merging two
frequent itemsets of size k The candidate is pruned if it
contains any infrequent subsets
of size kPruning step has to be modified: Prune only if subset
contains the first item e.g.: Candidate={Broccoli, Coke, Milk}
(ordered according to
minimum support) {Broccoli,
Coke} and {Broccoli, Milk} are frequent but
{Coke, Milk} is infrequent
Candidate is not pruned because {Coke,Milk} does not contain
the first item, i.e., Broccoli.
2002
Pattern EvaluationAssociation rule algorithms tend to produce
too many rules many of them are uninteresting or
have same support & confidence
Interestingness measures can be used to prune/rank the derived
patterns
In the original formulation of association rules, support &
confidence are the only measures used

2002
Application of Interestingness Measure
2002
information needed to compute rule interestingness can be
obtained from a contingency table
Contingency table
Used to define various measures support, confidence, lift, Gini,
J-measure, etc.YY Xf11f10f1+X f01f00fo+f+1f+0|T|
2002
Drawback of Confidence
Coffee
CoffeeTea15520Tea755809010100
2002
Statistical IndependencePopulation of 1000 students600
students know how to swim (S)700 students know how to bike
(B)420 students know how to swim and bike (S,B)

Negatively correlated
2002
Statistical-based MeasuresMeasures that take into account
statistical dependence
2002
Example: Lift/Interest
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.9 Lift = 0.75/0.9= 0.8333 (< 1, therefore is
negatively associated)
Coffee
CoffeeTea15520Tea755809010100
2002
Drawback of Lift & Interest
Statistical independence:
If P(X,Y)=P(X)P(Y) => Lift =
1YYX10010X090901090100YYX90090X010109010100
2002

There are lots of measures proposed in the literature
Some measures are good for certain applications, but not for
others
What criteria should we use to determine whether a measure is
good or bad?
What about Apriori-style support based pruning? How does it
affect these measures?
2002
Properties of A Good MeasurePiatetsky-Shapiro:
3 properties a good measure M must satisfy:M(A,B) = 0 if A
and B are statistically independent
M(A,B) increase monotonically with P(A,B) when P(A) and
P(B) remain unchanged
M(A,B) decreases monotonically with P(A) [or P(B)] when
P(A,B) and P(B) [or P(A)] remain unchanged
2002
Comparing Different Measures
10 examples of contingency tables:
Rankings of contingency tables using various measures:

2002
Sheet1ExampleE18123834241370100001.1581671343E2833026
221046100001.116800672E3948194127298100001.030581565E
43954308052961100001.4198707063E528861363132044311000
01.6148802655E6150020005006000100002.1428571429E74000
200010003000100001.3333333333E84000200020002000100001
.1111111111E91720712151154100001.127815235E1061248347
452100003.6889211418
Sheet2
Sheet3
Property under Variable Permutation
Does M(A,B) = M(B,A)?
Symmetric measures: support, lift, collective strength, cosine,
Jaccard, etc
Asymmetric measures: confidence, conviction, Laplace, J-
measure, etc
2002
Property under Row/Column Scaling
Grade-Gender Example (Mosteller, 1968):
Mosteller:
Underlying association should be independent of
the relative number of male and female students

in the samples
2x
10xMaleFemaleHigh235Low1453710MaleFemaleHigh43034Lo
w2404267076
2002
Property under Inversion Operation
Transaction 1
Transaction N
.
.
.
.
.
2002
- -coefficient is analogous to
correlation coefficient for continuous variables
tablesYYX601070X1020307030100YYX201030X106070307010
0
2002
Property under Null Addition

Invariant measures: support, cosine, Jaccard, etc
Non-invariant measures: correlation, Gini, mutual information,
odds ratio, etc
2002
Different Measures have Different Properties
2002
Sheet1SymbolMeasureRangeP1P2P3O1O2O3O3'O4FCorrelation
-1 … 0 … 1YesYesYesYesNoYesYesNolLambda0 …
1YesNoNoYesNoNo*YesNoaOdds
ratioYes*YesYesYesYesYes*YesNoQYule's Q-1 … 0 …
1YesYesYesYesYesYesYesNoYYule's Y-1 … 0 …
1YesYesYesYesYesYesYesNokCohen's-1 … 0 …
1YesYesYesYesNoNoYesNoMMutual Information0 …
1YesYesYesYesNoNo*YesNoJJ-Measure0 …
1YesNoNoNoNoNoNoNoGGini Index0 …
1YesNoNoNoNoNo*YesNosSupport0 …
1NoYesNoYesNoNoNoNocConfidence0 …
1NoYesNoYesNoNoNoYesLLaplace0 …
1NoYesNoYesNoNoNoNoVConvictionNoYesNoYes**NoNoYes
NoIInterestYes*YesYesYesNoNoNoNoISIS (cosine)0 ..
1NoYesYesYesNoNoNoYesPSPiatetsky-Shapiro's-0.25 … 0 …
0.25YesYesYesYesNoYesYesNoFCertainty factor-1 … 0 …
1YesYesYesNoNoNoYesNoAVAdded value0.5 … 1 …
1YesYesYesNoNoNoNoNoSCollective
strengthNoYesYesYesNoYes*YesNozJaccard0 ..
1NoYesYesYesNoNoNoYesKKlosgen'sYesYesYesNoNoNoNoN
owhere:O1: Symmetry under variable permutationO2:
Marginal invarianceO3: Antisymmetry under row or column
permutationO3': Inversion invarianceO4: Null

invarianceYes*: Yes if measure is normalizedYes**: Yes if
measure is symmetrized by taking max(M(A,B),M(B,A))No*:
Symmetry under row or column permutation
Sheet2
Sheet3
3
3
2
0
3
1
3
2
1
3
2
K
K
÷
÷
ø
ö
ç
ç
è
æ
-
-
÷
÷
ø
ö
ç
ç
è

æ
-
MBD01E1AF4B.unknown
Support-based PruningMost of the association rule mining
algorithms use support measure to prune rules and itemsets
Study effect of support pruning on correlation of
itemsetsGenerate 10000 random contingency tablesCompute
support and pairwise correlation for each tableApply support-
based pruning and examine the tables that are removed
2002
Effect of Support-based Pruning
2002
Chart113951722823595036487168308679089158817596025323
702591948411
Correlation
All Itempairs
Chart20047972535428692755428141635000
Correlation
Support > 0.9
Chart372970951501822672963363833753703322722021618949
41174
Correlation
Support > 0.7
Chart413921642653274285425726406656646786455404063552
37139118515
Correlation
Support > 0.5

Chart510212218282846403325217210000000
Correlation
Support < 0.01
Chart61371686667849493937756301453000000
Correlation
Support < 0.03
Chart71391104106107128141151161136996745158421000
Correlation
Support < 0.05
Sheet1CorrelationAll> 0.9> 0.7> 0.5< 0.01< 0.03< 0.05-
1130713101313-0.99502992217191-0.81724701642268104-
0.72827952651866106-0.635991503272867107-
0.550371824282884128-0.4648252675424694141-
0.3716352965724093151-0.2830423366403393161-
0.18678638366525771360908923756642156990.191575370678
730670.288154332645214450.37592827254015150.4602142024
060380.5532161613550040.63703892370020.72595491390010.
81940411180000.98401751000111045000
Sheet2
Sheet3
Effect of Support-based Pruning
Support-based pruning eliminates mostly negatively correlated
itemsets
2002
Chart113951722823595036487168308679089 158817596025323
702591948411
Correlation
All Itempairs
Chart20047972535428692755428141635000
Correlation
Support > 0.9

Chart372970951501822672963363833753703322722021618949
41174
Correlation
Support > 0.7
Chart413921642653274285425726406656646786455404063552
37139118515
Correlation
Support > 0.5
Chart510212218282846403325217210000000
Correlation
Support < 0.01
Chart61371686667849493937756301453000000
Correlation
Support < 0.03
Chart71391104106107128141151161136996745158421000
Correlation
Support < 0.05
1130713101313-0.99502992217191-0.81724701642268104-
0.72827952651866106-0.635991503272867107-
0.550371824282884128-0.4648252675424694141-
0.3716352965724093151-0.2830423366403393161-
0.18678638366525771360908923756642156990.191575370678
730670.288154332645214450.37592827254015150.4602142024
060380.5532161613550040.63703892370020.72595491390010.
81940411180000.98401751000111045000
Sheet2
Sheet3
Chart113951722823595036487168308679089158817596025323
702591948411
Correlation
All Itempairs
Chart20047972535428692755428141635000
Correlation
Support > 0.9

Chart372970951501822672963363833753703322722021618949
41174
Correlation
Support > 0.7
Chart413921642653274285425726406656646786455404063552
37139118515
Correlation
Support > 0.5
Chart510212218282846403325217210000000
Correlation
Support < 0.01
Chart61371686667849493937756301453000000
Correlation
Support < 0.03
Chart71391104106107128141151161136996745158421000
Correlation
Support < 0.05
1130713101313-0.99502992217191-0.81724701642268104-
0.72827952651866106-0.635991503272867107-
0.550371824282884128-0.4648252675424694141-
0.3716352965724093151-0.2830423366403393161-
0.18678638366525771360908923756642156990.191575370678
730670.288154332645214450.37592827254015150.4602142024
060380.5532161613550040.63703892370020.72595491390010.
81940411180000.98401751000111045000
Sheet2
Sheet3
Effect of Support-based PruningInvestigate how support-based
pruning affects other measures
Steps:Generate 10000 contingency tablesRank each table
according to the different measuresCompute the pair-wise
correlation between the measures

2002
Effect of Support-based Pruning Without Support Pruning (All
Pairs) Red cells indicate correlation between
the pair of measures > 0.85 40.14% pairs have correlation >
0.85
Scatter Plot between Correlation & Jaccard Measure
2002
Effect of Support-
pairs have correlation > 0.85
Scatter Plot between Correlation & Jaccard Measure:
2002
Effect of Support-
pairs have correlation > 0.85
Scatter Plot between Correlation & Jaccard Measure
2002
Subjective Interestingness MeasureObjective measure: Rank
patterns based on statistics computed from datae.g., 21
measures of association (support, confidence, Laplace, Gini,

mutual information, Jaccard, etc).
Subjective measure:Rank patterns according to user’s
interpretation A pattern is subjectively interesting if it
contradicts the
expectation of a user (Silberschatz & Tuzhilin) A pattern is
subjectively interesting if it is actionable
(Silberschatz & Tuzhilin)
2002
Interestingness via UnexpectednessNeed to model expectation
of users (domain knowledge)
Need to combine expectation of users with evidence from data
(i.e., extracted patterns)
+
Pattern expected to be frequent
-
Pattern expected to be infrequent
Pattern found to be frequent
Pattern found to be infrequent
+
-
Expected Patterns

-
+
Unexpected Patterns
2002
Interestingness via UnexpectednessWeb Data (Cooley et al
2001)Domain knowledge in the form of site structureGiven an
itemset F = {X1, X2, …, Xk} (Xi : Web pages) L: number of
links connecting the pages lfactor = L / -1) cfactor = 1
(if graph is connected), 0 (disconnected graph)Structure
Usage evidence
Use Dempster-Shafer theory to combine domain knowledge and
evidence from data
2002
2002
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Beer
}
Diaper
,

Milk
{
Þ
null
ABACADAEBCBDBECDCEDE
ABCDE
ABCABDABEACDACEADEBCDBCEBDECDE
ABCDABCEABDEACDEBCDE
ABCDE
TID Items
1 Bread, Milk
N
Transactions
List of
Candidates
M
w
)
(
)
(
)
(
:
,
Y

s
X
s
Y
X
Y
X
³
Þ
Í
"
null
ABCDE
ABCDE
TIDA1A2A3A4A5A6A7A8A9A10B1B2B3B4B5B6B7B8B9B10
C1C2C3C4C5C6C7C8C9C10
1
1111111111
00000000000000000000
2
1111111111
00000000000000000000
3
1111111111
00000000000000000000
4
1111111111
00000000000000000000
5
1111111111
00000000000000000000
60000000000
1111111111

0000000000
70000000000
1111111111
0000000000
80000000000
1111111111
0000000000
90000000000
1111111111
0000000000
100000000000
1111111111
0000000000
1100000000000000000000
1111111111
1200000000000000000000
1111111111
1300000000000000000000
1111111111
1400000000000000000000
1111111111
1500000000000000000000
1111111111
TID Items
1 Bread, Milk
N

Transactions
Hash Structure
k
Buckets
4
.
0
5
2
|
T
|
)
Beer
Diaper,
,
Milk
(
=
=
=
s
s
67
.
0
3
2
)
Diaper
,
Milk
(
)
Beer
Diaper,

Milk,
(
=
=
=
s
s
c
å
=
÷
ø
ö
ç
è
æ
´
=
10
1
10
3
k
k
null
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
A

B
C
D
E
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
BCDE
ABCDE
null
ABCDE
ABCD
E
1
2
3
1
1
1
1
+
-

=
ú
û
ù
ê
ë
é
÷
ø
ö
ç
è
æ
-
´
÷
ø
ö
ç
è
æ
=
+
-
=
-
=
å
å
d
d
d
k
k
d
j

j
k
d
k
d
R
1 2 3 5 6
Transaction, t
2 3 5 613 5 62
5 61 33 5 61 261 55 62 362 5
5 63
1 2 3
1 2 5
1 2 6
1 3 5
1 3 6
1 5 6
2 3 5
2 3 6
2 5 63 5 6
Subsets of 3 items
Level 1
Level 2
Level 3
63 5
TIDItems
1{A,B}
2{B,C,D}
3{A,B,C,D}
4{A,B,D}
5{A,B,C,D}
Item
Count
Bread
4
Coke

2
Milk
4
Beer
3
Diaper
4
Eggs
1
Itemset
Count
{
Bread,Milk}
3
{
Bread,Beer}
2
{
Bread,Diaper}
3
{
Milk,Beer}
2
{
Milk,Diaper}
3
{
Beer,Diaper}
3
Itemset
Count
{Bread,Milk,Diaper}
3

ItemsetSupport
{A}4
{B}5
{C}3
{D}4
{A,B}4
{A,C}2
{A,D}3
{B,C}3
{B,D}4
{C,D}3
ItemsetSupport
{A,B,C}2
{A,B,D}3
{A,C,D}2
{B,C,D}3
{A,B,C,D}2
TIDItems
1ABC
2ABCD
3BCE
4ACDE
5DE
TID Items
1 Bread, Milk
null
AB
AC
AD

AE
BC
BD
BE
CD
CE
DE
A
B
C
D
E
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
BCDE
ABCDE
124
123
1234
245
345
12
124
24

4
123
2
3
24
34
45
12
2
24
4
4
2
3
4
2
4
null
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
A
B
C
D
E
ABC
ABD
ABE

ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
BCDE
ABCDE
124
123
1234
245
345
12
124
24
4
123
2
3
24
34
45
12
2
24
4
4
2
3
4

2
4
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
Frequent
itemset
border
null
{a
1
,a
2
,...,a
n
}
(a) General-to-specific
null
{a
1
,a
2
,...,a
n
}
Frequent
itemset
border
(b) Specific-to-general
..
..

..
..
Frequent
itemset
border
null
{a
1
,a
2
,...,a
n
}
(c) Bidirectional
..
..
null
ABACADBCBD
CD
AB
C
D
ABCABDACD
BCD
ABCD
null
ABAC
ADBCBDCD
AB
C
D
ABC
ABDACDBCD
ABCD
(a) Prefix tree(b) Suffix tree
(a) Breadth first(b) Depth first

TIDItems
1A,B,E
2B,C,D
3C,E
4A,C,D
5A,B,C,D
6A,E
7A,B
8A,B,C
9A,C,D
10B
Horizontal
Data Layout
ABCDE
11221
42343
55456
6789
789
810
9
Vertical Data Layout
TIDItems
1{A,B}
2{B,C,D}
3{A,C,D,E}
4{A,D,E}
5{A,B,C}
6{A,B,C,D}
7{B,C}
8{A,B,C}
9{A,B,D}
10{B,C,E}
ItemPointer
A
B

C
D
E
null
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
A
B
C
D
E
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
BCDE
ABCDE
TIDItems

1{B}
2{}
3{C,D,E}
4{D,E}
5{B,C}
6{B,C,D}
7{}
8{B,C}
9{B,D}
10{}
TID
Items
1
A,B,E
2
B,C,D
3
C,E
4
A,C,D
5
A,B,C,D
6
A,E
7
A,B
8
A,B,C
9
A,C,D
10
B
Horizontal
Data Layout
A
B

C
D
E
1
1
2
2
1
4
2
3
4
3
5
5
4
5
6
6
7
8
9
7
8
9
8
10
9
Vertical Data Layout
A
1
4
5
6
7
8

9
B
1
2
5
7
8
10
AB
1
5
7
8
ABCD=>{ }
BCD=>AACD=>BABD=>CABC=>D
BC=>ADBD=>ACCD=>ABAD=>BCAC=>BDAB=>CD
D=>ABCC=>ABDB=>ACDA=>BCD
ABCD=>{ }
BCD=>AACD=>BABD=>CABC=>D
BC=>ADBD=>ACCD=>ABAD=>BCAC=>BDAB=>CD
D=>ABCC=>ABDB=>ACDA=>BCD
BD=>
AC
CD=>
AB
D=>
ABC
A
Item
MS(I)
Sup(I)
A
0.10%
0.25%
B
0.20%

0.26%
C
0.30%
0.29%
D
0.50%
0.05%
E
3%
4.20%
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
A
B

C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
Item
MS(I)
Sup(I)
A
0.10%
0.25%
B
0.20%
0.26%
C
0.30%
0.29%
D

0.50%
0.05%
E
3%
4.20%
Featur
e
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Featur
e
Featur
e
Featur
e
Featur
e
Featur

e
Featur
e
Featur
e
Featur
e
Featur
e
Selection
Preprocessing
Mining
Postprocessing
Data
Selected
Data
Preprocessed
Data
Patterns
Knowledge
)]
(
1
)[
(
)]
(
1
)[
(
)
(
)
(
)
,

(
)
(
)
(
)
,
(
)
(
)
(
)
,
(
)
(
)
|
(
Y
P
Y
P
X
P
X
P
Y
P
X
P
Y
X
P
t

coefficien
Y
P
X
P
Y
X
P
PS
Y
P
X
P
Y
X
P
Interest
Y
P
X
Y
P
Lift
-
-
-
=
-
-
=
=
=
f
10
)
1

.
0
)(
1
.
0
(
1
.
0
=
=
Lift
11
.
1
)
9
.
0
)(
9
.
0
(
9
.
0
=
=
Lift
Example
f
11
f
10

f
01
f
00
E18123834241370
E2833026221046
E3948194127298
E43954308052961
E52886136313204431
E6150020005006000
E74000200010003000
E84000200020002000
E91720712151154
E1061248347452
B
B
A
p
q
A
r
s
A
A

B
p
r
B
q
s
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
1

1
1
1
1
1
1
1
0
1
1
1
1
0
1
1
1
1
1
A
B
C
D
(a)
(b)
0
1
1
1
1
1
1
1
1
0
0
0

0
0
1
0
0
0
0
0
(c)
E
F
5238
.
0
3
.
0
7
.
0
3
.
0
7
.
0
7
.
0
7
.
0
6
.
0
=

´
´
´
´
-
=
f
5238
.
0
3
.
0
7
.
0
3
.
0
7
.
0
3
.
0
3
.
0
2
.
0
=
´
´
´
´

-
=
f
B
B
A
p
q
A
r
s
B
B
A
p
q
A
r

s + k
SymbolMeasureRangeP1P2P3O1O2O3O3'O4
Correlation-1 … 0 … 1YesYesYesYesNoYesYesNo
Lambda0 … 1YesNoNoYesNoNo*YesNo
Q
Yule's Q-1 … 0 … 1YesYesYesYesYesYesYesNo
Y
Yule's Y-1 … 0 … 1YesYesYesYesYesYesYesNo
Cohen's-1 … 0 … 1YesYesYesYesNoNoYesNo
M
Mutual Information0 … 1YesYesYesYesNoNo*YesNo
J
J-Measure0 … 1YesNoNoNoNoNoNoNo
G
Gini Index0 … 1YesNoNoNoNoNo*YesNo
s
Support0 … 1NoYesNoYesNoNoNoNo
c
Confidence0 … 1NoYesNoYesNoNoNoYes
L
Laplace0 … 1NoYesNoYesNoNoNoNo
V
oYes**NoNoYesNo
I
IS
IS (cosine)0 .. 1NoYesYesYesNoNoNoYes
PS
Piatetsky-Shapiro's-0.25 … 0 …

0.25YesYesYesYesNoYesYesNo
F
Certainty factor-1 … 0 … 1YesYesYesNoNoNoYesNo
AV
Added value0.5 … 1 … 1YesYesYesNoNoNoNoNo
S
Jaccard0 .. 1NoYesYesYesNoNoNoYes
K
Klosgen'sYesYesYesNoNoNoNoNo
33
2
0
3
1
321
3
2

All Itempairs
0
100
200
300
400
500
600
700
800
900
1000
-1
-0.9-0.8-0.7-0.6-0.5-0.4-0.3-0.2-0.1
0
0.10.20.30.40.50.60.70.80.9
1
Correlation
Support < 0.01
0
50
100
150
200
250
300
-1
-0.9-0.8-0.7-0.6-0.5-0.4-0.3-0.2-0.1
0
0.10.20.30.40.50.60.70.80.9
1
Correlation
Support < 0.03
0

50
100
150
200
250
300
-1
-0.9-0.8-0.7-0.6-0.5-0.4-0.3-0.2-0.1
0
0.10.20.30.40.50.60.70.80.9
1
Correlation
Support < 0.05
0
50
100
150
200
250
300
-1
-0.9-0.8-0.7-0.6-0.5-0.4-0.3-0.2-0.1
0
0.10.20.30.40.50.60.70.80.9
1
Correlation
All Pairs (40.14%)
1
2
3
4
5
6
7
8
9

10
11
12
13
14
15
16
17
18
19
20
21
Conviction
Odds ratio
Col Strength
Correlation
Interest
PS
CF
Yule Y
Reliability
Kappa
Klosgen
Yule Q
Confidence
Laplace
IS
Support
Jaccard
Lambda
Gini
J-measure
Mutual Info
-1
-0.8
-0.6

-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Correlation
Jaccard
0.005 <= support <= 0.500 (61.45%)
1
2
3
4
5
6
7
8
9
10
11
12
13
14

15
16
17
18
19
20
21
Interest
Conviction
Odds ratio
Col Strength
Laplace
Confidence
Correlation
Klosgen
Reliability
PS
Yule Q
CF
Yule Y
Kappa
IS
Jaccard
Support
Lambda
Gini
J-measure
Mutual Info
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4

0.6
0.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Correlation
Jaccard
0.005 <= support <= 0.300 (76.42%)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

20
21
Support
Interest
Reliability
Conviction
Yule Q
Odds ratio
Confidence
CF
Yule Y
Kappa
Correlation
Col Strength
IS
Jaccard
Laplace
PS
Klosgen
Lambda
Mutual Info
Gini
J-measure
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0
0.1
0.2
0.3
0.4

0.5
0.6
0.7
0.8
0.9
1
Correlation
Jaccard
)
...
(
)
...
(
2
1
2
1
k
k
X
X
X
P
X
X
X
P
È
È
È
=
I
I
I

Data Mining
Association Rules: Advanced Concepts and Algorithms
Lecture Notes for Chapter 7
Introduction to Data Mining
by
Tan, Steinbach, Kumar
Continuous and Categorical Attributes
Example of Association Rule:
= No}
How to apply association analysis formulation to non-
asymmetric binary variables?
Session Id
Country
Session Length (sec)
Number of Web Pages viewed
Gender
Browser Type
Buy
1
USA
982

8
Male
IE
No
2
China811
10
Female
Netscape
No
3
USA
2125
45
Female
Mozilla
Yes
4
Germany
596
4
Male
IE
Yes
5
Australia
123
9
Male
Mozilla
No
…
…

…
…
…
…
…
10
Handling Categorical AttributesTransform categorical attri bute
into asymmetric binary variables
Introduce a new “item” for each distinct attribute-value
pairExample: replace Browser Type attribute with Browser Type
= Internet Explorer Browser Type = Mozilla Browser Type =
Mozilla
Handling Categorical AttributesPotential IssuesWhat if attribute
has many possible values Example: attribute country has more
than 200 possible values Many of the attribute values may have
very low support
Potential solution: Aggregate the low-support attribute values
What if distribution of attribute values is highly skewed
Example: 95% of the visitors have Buy = No Most of the items
will be associated with (Buy=No) item
Potential solution: drop the highly frequent items
Handling Continuous AttributesDifferent kinds of
Different methods:Discretization-basedStatistics-basedNon-
discretization based minApriori

Handling Continuous AttributesUse
discretizationUnsupervised:Equal-width binningEqual-depth
binningClustering
Supervised:
bin1
bin3
bin2
Attribute values,
vClassv1v2v3v4v5v6v7v8v9Anomalous002010200000Normal15
0100000100100150100
Discretization IssuesSize of the discretized intervals affect
support & confidence
If intervals too small may not have enough supportIf intervals
too large may not have enough confidencePotential solution: use
all possible intervals
Discretization IssuesExecution timeIf intervals contain n values,
there are on average O(n2) possible ranges

Too many rules
{Refund =
Approach by Srikant & AgrawalPreprocess the dataDiscretize
attribute using equi-depth partitioning Use partial completeness
measure to determine number of partitions Merge adjacent
intervals as long as support is less than max-support
Apply existing association rule mining algorithms
Determine interesting rules in the output
Approach by Srikant & AgrawalDiscretization will lose
information
Use partial completeness measure to determine how much
information is lost
C: frequent itemsets obtained by considering all
ranges of attribute values
P: frequent itemsets obtained by considering all ranges
over the partitions
P is K-
such that:
1. X’ is a generaliz

support(Y)
Given K (partial completeness level), can determine
number of intervals (N)
X
Approximated X
Interestingness Measure
Given an itemset: Z = {z1, z2, …, zk} and its generalization Z’
= {z1’, z2’, …, zk’}
P(Z): support of Z
EZ’(Z): expected support of Z based on Z’
Z is R-
{Refund = N
ES’(Y|X): expected support of Z based on Z’

Rule S is R-interesting w.r.t its ancestor rule S’ if Support, P(S)
Statistics-based MethodsExample:
consequent consists of a continuous variable, characterized by
their statistics mean, median, standard deviation,
etc.Approach:Withhold the target variable from the rest of the
dataApply existing frequent itemset generation on the rest of the
dataFor each frequent itemset, compute the descriptive statistics
for the corresponding target variable Frequent itemset becomes
a rule by introducing the target variable as rule
consequentApply statistical test to determine interestingness of
the rule
Statistics-based MethodsHow to determine whether an
association rule interesting?Compare the statistics for segment
of population covered by the rule vs segment of population not
covered by the rule:
variance 1 under null hypothesis
Statistics-based MethodsExample:

years (i n1 = 50, s1 = 3.5For r’
(complement): n2 = 250, s2 = 6.5
For 1-sided test at 95% confidence level, critical Z-value for
rejecting null hypothesis is 1.64.Since Z is greater than 1.64, r
is an interesting rule
Min-Apriori (Han et al)
Example:
W1 and W2 tends to appear together in the same document
Document-term matrix:
Sheet1TIDW1W2W3W4W5TIDW1W2W3W4W5D122001D10.4
0.40.00.00.2D200122D20.00.00.20.40.4D323000D30.40.60.00.0
0.0D400101D40.00.00.50.00.5D511102D50.20.20.20.00.4
Sheet2
Sheet3
Min-AprioriData contains only continuous attributes of the
same “type”e.g., frequency of words in a document
Potential solution:Convert into 0/1 matrix and then apply
existing algorithms lose word frequency
informationDiscretization does not apply as users want
association among words not ranges of words

0.40.00.00.2D200122D20.00.00.20.40.4D323000D30.40.60.00.0
0.0D400101D40.00.00.50.00.5D511102D50.20.20.20.00.4
Sheet2
Sheet3
Min-AprioriHow to determine the support of a word?If we
simply sum up its frequency, support count will be greater than
total number of documents! Normalize the word vectors – e.g.,
using L1 norm Each word has a support equals to 1.0
Normalize
0.40.00.00.2D200122D20.00.00.20.40.4D323000D30.40.60.00.0
0.0D400101D40.00.00.50.00.5D511102D50.20.20.20.00.4
Sheet2
Sheet3
00.330.000.000.17D200122D20.000.000.331.000.33D323000D3
0.400.500.000.000.00D400101D40.000.000.330.000.17D511102
D50.200.170.330.000.33
Sheet2
Sheet3
Min-AprioriNew definition of support:

Example:
Sup(W1,W2,W3)
= 0 + 0 + 0 + 0 + 0.17
= 0.17
00.330.000.000.17D200122D20.000.000.331.000.33D323000D3
0.400.500.000.000.00D400101D40.000.000.330.000.17D511102
D50.200.170.330.000.33
Sheet2
Sheet3
Anti-monotone property of Support
Example:
Sup(W1) = 0.4 + 0 + 0.4 + 0 + 0.2 = 1
Sup(W1, W2) = 0.33 + 0 + 0.4 + 0 + 0.17 = 0.9
Sup(W1, W2, W3) = 0 + 0 + 0 + 0 + 0.17 = 0.17
00.330.000.000.17D200122D20.000.000.331.000.33D323000D3
0.400.500.000.000.00D400101D40.000.000.330.000.17D511102
D50.200.170.330.000.33
Sheet2
Sheet3
Multi-level Association Rules

Multi-level Association RulesWhy should we incorporate
concept hierarchy?Rules at lower levels may not have enough
support to appear in any frequent itemsets
Rules at lower levels of the hierarchy are overly specific e.g.,
are indicative of association between milk and bread
Multi-level Association RulesHow do support and confidence
vary as we traverse the concept hierarchy?If X is the parent
item for both X1 and X2, then
If
and X is parent of X1, Y is parent of Y1
then
If
then
Multi-level Association RulesApproach 1:Extend current
association rule formulation by augmenting each transaction
with higher level items

Original Transaction: {skim milk, wheat bread}
Augmented Transaction:
{skim milk, wheat bread, milk, bread, food}
Issues:Items that reside at higher levels have much higher
support counts if support threshold is low, too many frequent
patterns involving items from the higher levelsIncreased
dimensionality of the data
Multi-level Association RulesApproach 2:Generate frequent
patterns at highest level first
Then, generate frequent patterns at the next highest level, and
so on
Issues:I/O requirements will increase dramatically because we
need to perform more passes over the dataMay miss some
potentially interesting cross-level association patterns
Sequence Data
Sequence Database:
10�
15�
20�
25�
30�
35�
2�
3�
5�
6�
1�

�
1�
Timeline�
Object A:�
Object B:�
Object C:�
�
4�
5�
6�
�
2�
7�
8�
1�
�
2�
�
1�
6�
1�
7�
8�
Examples of Sequence Data
Sequence
E1
E2
E1
E3
E2
E3

E4
E2
Element (Transaction)
Event
(Item)Sequence DatabaseSequenceElement (Transaction)Event
(Item)CustomerPurchase history of a given customerA set of
items bought by a customer at time tBooks, diary products,
CDs, etcWeb DataBrowsing activity of a particular Web
visitorA collection of files viewed by a Web visitor after a
single mouse clickHome page, index page, contact info,
etcEvent dataHistory of events generated by a given
sensorEvents triggered by a sensor at time tTypes of alarms
generated by sensors Genome sequencesDNA sequence of a
particular speciesAn element of the DNA sequence Bases
A,T,G,C
Formal Definition of a SequenceA sequence is an ordered list of
elements (transactions)
s = < e1 e2 e3 … >
Each element contains a collection of events (items)
ei = {i1, i2, …, ik}
Each element is attributed to a specific time or location
Length of a sequence, |s|, is given by the number of elements of
the sequence
A k-sequence is a sequence that contains k events (items)
Examples of SequenceWeb sequence:

< {Homepage} {Electronics} {Digital Cameras} {Canon
Digital Camera} {Shopping Cart} {Order Confirmation}
{Return to Shopping} >
Sequence of initiating events causing the nuclear accident at 3-
mile Island:
(http://stellar-
one.com/nuclear/staff_reports/summary_SOE_the_initiating_eve
nt.htm)
< {clogged resin} {outlet valve closure} {loss of feedwater}
{condenser polisher outlet valve shut} {booster pumps trip}
{main waterpump trips} {main turbine trips} {reactor pressure
increases}>
Sequence of books checked out at a library:
<{Fellowship of the Ring} {The Two Towers} {Return of the
King}>
Formal Definition of a SubsequenceA sequence <a1 a2 … an> is
contained in another sequence <b1 b2 … bm> (m ≥ n) if there
exist integers
i1 < i2 < … < in s
The support of a subsequence w is defined as the fraction of
data sequences that contain wA sequential pattern is a frequent
subsequence (i.e., a subsequence whose support is ≥
minsup)Data sequenceSubsequenceContain?< {2,4} {3,5,6} {8}
>< {2} {3,5} >Yes< {1,2} {3,4} > < {1} {2} >No< {2,4} {2,4}

{2,5} >< {2} {4} >Yes
Sequential Pattern Mining: DefinitionGiven: a database of
sequences a user-specified minimum support threshold, minsup
Task:Find all subsequences with support ≥ minsup
Sequential Pattern Mining: ChallengeGiven a sequence: <{a b}
{c d e} {f} {g h i}>Examples of subsequences:
<{a} {c d} {f} {g} >, < {c d e} >, < {b} {g} >, etc.
How many k-subsequences can be extracted from a given n-
sequence?
<{a b} {c d e} {f} {g h i}> n = 9
k=4: Y _ _ Y Y _ _ _ Y
<{a} {d e} {i}>
Sequential Pattern Mining: Example
Minsup = 50%
Examples of Frequent Subsequences:
< {1,2} > s=60%
< {2,3} > s=60%
< {2,4}> s=80%
< {3} {5}> s=80%
< {1} {2} > s=80%

< {2} {2} > s=60%
< {1} {2,3} > s=60%
< {2} {2,3} > s=60%
< {1,2} {2,3} > s=60%
Extracting Sequential PatternsGiven n events: i1, i2, i3, …, in
Candidate 1-subsequences:
<{i1}>, <{i2}>, <{i3}>, …, <{in}>
<{i1, i2}>, <{i1, i3}>, …, <{i1} {i1}>, <{i1} {i2}>, …, <{in-
1} {in}>
<{i1, i2 , i3}>, <{i1, i2 , i4}>, …, <{i1, i2} {i1}>, <{i1, i2}
{i2}>, …,
<{i1} {i1 , i2}>, <{i1} {i1 , i3}>, …, <{i1} {i1} {i1}>, <{i1}
{i1} {i2}>, …
Generalized Sequential Pattern (GSP)Step 1: Make the first pass
over the sequence database D to yield all the 1-element frequent
sequences
Step 2:
Repeat until no new frequent sequences are
foundCandidate Generation: Merge pairs of frequent
subsequences found in the (k-1)th pass to generate candidate
sequences that contain k items
Candidate Pruning:Prune candidate k-sequences that contain
infrequent (k-1)-subsequences
Support Counting:Make a new pass over the sequence database
D to find the support for these candidate sequences
Candidate Elimination:Eliminate candidate k-sequences whose
actual support is less than minsup

Candidate GenerationBase case (k=2): Merging two frequent 1-
sequences <{i1}> and <{i2}> will produce two candidate 2-
sequences: <{i1} {i2}> and <{i1 i2}>
General case (k>2):A frequent (k-1)-sequence w1 is merged
with another frequent
(k-1)-sequence w2 to produce a candidate k-sequence if the
subsequence obtained by removing the first event in w 1 is the
same as the subsequence obtained by removing the last event in
w2 The resulting candidate after merging is given by the
sequence w1 extended with the last event of w2.
If the last two events in w2 belong to the same element, then the
last event in w2 becomes part of the last element in w1
Otherwise, the last event in w2 becomes a separate element
appended to the end of w1
Candidate Generation ExamplesMerging the sequences
w1=<{1} {2 3} {4}> and w2 =<{2 3} {4 5}>
will produce the candidate sequence < {1} {2 3} {4 5}> because
the last two events in w2 (4 and 5) belong to the same element
Merging the sequences
w1=<{1} {2 3} {4}> and w2 =<{2 3} {4} {5}>
will produce the candidate sequence < {1} {2 3} {4} {5}>
because the last two events in w2 (4 and 5) do not belong to the
same element
We do not have to merge the sequences

w1 =<{1} {2 6} {4}> and w2 =<{1} {2} {4 5}>
to produce the candidate < {1} {2 6} {4 5}> because if the
latter is a viable candidate, then it can be obtained by merging
w1 with
< {1} {2 6} {5}>
GSP Example
< {1} {2} {3} >
< {1} {2 5} >�< {1} {5} {3} >
< {2} {3} {4} >
< {2 5} {3} >
< {3} {4} {5} >
< {5} {3 4} >�
< {1} {2} {3} {4} >
< {1} {2 5} {3} >
< {1} {5} {3 4} >�< {2} {3} {4} {5} >�< {2 5} {3 4} >�
< {1} {2 5} {3} >�
Frequent �3-sequences�
Candidate�Generation�
Candidate�Pruning�
Timing Constraints (I)
{A B} {C} {D E}
<= ms
<= xg
>ng
xg: max-gap
ng: min-gap

ms: maximum span
xg = 2, ng = 0, ms= 4Data sequenceSubsequenceContain?<
{2,4} {3,5,6} {4,7} {4,5} {8} >< {6} {5} >Yes< {1} {2} {3}
{4} {5}>< {1} {4} >No< {1} {2,3} {3,4} {4,5}>< {2} {3} {5}
> Yes< {1,2} {3} {2,3} {3,4} {2,4} {4,5}>< {1,2} {5} >No
Mining Sequential Patterns with Timing ConstraintsApproach
1:Mine sequential patterns without timing
constraintsPostprocess the discovered patterns
Approach 2:Modify GSP to directly prune candidates that
violate timing constraintsQuestion: Does Apriori principle still
hold?
Apriori Principle for Sequence Data
Suppose:
xg = 1 (max-gap)
ng = 0 (min-gap)
ms = 5 (maximum span)
minsup = 60%
<{2} {5}> support = 40%
but
<{2} {3} {5}> support = 60%
Problem exists because of max-gap constraint
No such problem if max-gap is infinite
Contiguous Subsequencess is a contiguous subsequence of
w = <e1>< e2>…< ek>

if any of the following conditions hold:
s is obtained from w by deleting an item from either e1 or ek
s is obtained from w by deleting an item from any element ei
that contains more than 2 items
s is a contiguous subsequence of s’ and s’ is a contiguous
subsequence of w (recursive definition)
Examples: s = < {1} {2} > is a contiguous subsequence of
< {1} {2 3}>, < {1 2} {2} {3}>, and < {3 4} {1 2} {2 3}
{4} > is not a contiguous subsequence of
< {1} {3} {2}> and < {2} {1} {3} {2}>
Modified Candidate Pruning StepWithout maxgap constraint:A
candidate k-sequence is pruned if at least one of its (k-1)-
subsequences is infrequent
With maxgap constraint:A candidate k-sequence is pruned if at
least one of its contiguous (k-1)-subsequences is infrequent
Timing Constraints (II)
xg: max-gap
ng: min-gap
ws: window size
ms: maximum span
xg = 2, ng = 0, ws = 1, ms= 5Data
sequenceSubsequenceContain?< {2,4} {3,5,6} {4,7} {4,6} {8}
>< {3} {5} >No< {1} {2} {3} {4} {5}>< {1,2} {3} > Yes<
{1,2} {2,3} {3,4} {4,5}>< {1,2} {3,4} >Yes
Modified Support Counting StepGiven a candidate pattern: <{a,

c}>Any data sequences that contain
<… {a c} … >,
<… {a} … {c}…> ( where time({c}) – time({a}) ≤ ws)
<…{c} … {a} …> (where time({a}) – time({c}) ≤ ws)
will contribute to the support count of candidate pattern
Other FormulationIn some domains, we may have only one very
long time seriesExample: monitoring network traffic events for
attacks monitoring telecommunication alarm signalsGoal is to
find frequent sequences of events in the time seriesThis problem
is also known as frequent episode mining
E1
E2
E1
E2
E1
E2
E3
E4
E3 E4
E1
E2
E2 E4
E3 E5
E2
E3 E5
E1
E2
E3 E1

Pattern: <E1> <E3>
General Support Counting Schemes
Assume:
xg = 2 (max-gap)
ng = 0 (min-gap)
ws = 0 (window size)
ms = 2 (maximum span)
2�
3�
4�
5�
p�
q�
p�
6�
7�
q�
q�
Sequence: (p) (q)�
Method Support� Count�
COBJ�
1�
p�
1�
q�
q�
p�

p�
CWIN�
6�
Object's Timeline�
CMINWIN�
4�
CDIST_O�
8�
CDIST�
5�
Frequent Subgraph MiningExtend association rule mining to
finding frequent subgraphsUseful for Web Mining,
computational chemistry, bioinformatics, spatial data sets, etc
Graph Definitions
Representing Transactions as GraphsEach transaction is a clique
of items
Sheet1Transaction
IdItems1{A,B,C,D}2{A,B,E}3{B,C}4{A,B,D,E}5{B,C,D}
Sheet2
Sheet3

Representing Graphs as Transactions
ChallengesNode may contain duplicate labelsSupport and
confidenceHow to define them?Additional constraints imposed
by pattern structureSupport and confidence are not the only
constraintsAssumption: frequent subgraphs must be
connectedApriori-like approach: Use frequent k-subgraphs to
generate frequent (k+1) subgraphsWhat is k?
Challenges…Support: number of graphs that contain a particular
subgraph
Apriori principle still holds
Level-wise (Apriori-like) approach:Vertex growing: k is the
number of verticesEdge growing: k is the number of edges
Vertex Growing
a�
a�
e�
a�
p�
p�
q�
r�
p�

a�
a�
r�
a�
a�
a�
d�
p�
r�
a�
p�
q�
r�
e�
p�
G1�
G2�
G3 = join(G1,G2)�
d�
r�
+�
Edge Growing
a�
a�
f�
a�
p�
+�
p�
q�
r�
p�

a�
a�
r�
a�
a�
a�
f�
p�
r�
a�
p�
q�
r�
f�
p�
G1�
G2�
G3 = join(G1,G2)�
r�
Apriori-like AlgorithmFind frequent 1-
subgraphsRepeatCandidate generation Use frequent (k-1)-
subgraphs to generate candidate k-subgraphCandidate pruning
Prune candidate subgraphs that contain infrequent
(k-1)-subgraphs Support counting Count the support of each
remaining candidateEliminate candidate k-subgraphs that are
infrequent
In practice, it is not as easy. There are many other issues
Example: Dataset

a�
b�
e�
c�
q�
p�
q�
d�
r�
p�
a�
b�
r�
e�
d�
c�
p�
a�
a�
p�
q�
r�
b�
p�
G1�
G2�
r�
G3�
d�
e�
r�
q�
c�
d�
p�
p�

p�
G4�
r�
Example
a�
q�
b�
a�
e�
p�
r�
b�
d�
b�
c�
d�
e�
k=1
Frequent�Subgraphs�
k=3�Candidate�Subgraphs�
a�
p�
c�
d�
p�
c�
e�
p�
a�
b�
d�
r�

p�
d�
c�
e�
p�
(Pruned candidate)�
Minimum support count = 2�
k=2
Frequent�Subgraphs�
Candidate GenerationIn Apriori:Merging two frequent k-
itemsets will produce a candidate (k+1)-itemset
In frequent subgraph mining (vertex/edge growing)Merging two
frequent k-subgraphs may produce more than one candidate
(k+1)-subgraph
Multiplicity of Candidates (Vertex Growing)
a�
a�
e�
a�
p�
p�
q�
r�
p�
a�
a�
r�
a�
a�

a�
d�
p�
r�
a�
p�
q�
r�
e�
p�
G1�
G2�
G3 = join(G1,G2)�
d�
r�
?�
+�
Multiplicity of Candidates (Edge growing)Case 1: identical
vertex labels
Multiplicity of Candidates (Edge growing)Case 2: Core contains
identical labels
Core: The (k-1) subgraph that is common
between the joint graphs
a�
a�
a�
a�

a�
a�
a�
a�
a�
a�
a�
a�
+�
c�
�
b�
�
a�
a�
a�
a�
b�
�
c�
�
c�
�
a�
a�
a�
a�
c�
�
b�
�
b�
�

Multiplicity of Candidates (Edge growing)Case 3: Core
multiplicity
Adjacency Matrix Representation The same graph can be
represented in many ways
Sheet1A(1)A(2)A(3)A(4)B(5)B(6)B(7)B(8)A(1)11101000A(2)1
1010100A(3)10110010A(4)01110001B(5)10001110B(6)0100110
1B(7)00101011B(8)00010111A(1)A(2)A(3)A(4)B(5)B(6)B(7)B(
8)A(1)11010100A(2)11100010A(3)01111000A(4)10110001B(5)
00101011B(6)10000111B(7)01001110B(8)00011101
Sheet2
Sheet3
Sheet1A(1)A(2)A(3)A(4)B(5)B(6)B(7)B(8)A(1)11101000A(2)1
1010100A(3)10110010A(4)01110001B(5)10001110B(6)0100110
1B(7)00101011B(8)00010111A(1)A(2)A(3)A(4)B(5)B(6)B(7)B(
8)A(1)11010100A(2)11100010A(3)01111000A(4)10110001B(5)
00101011B(6)10000111B(7)01001110B(8)00011101
Sheet2
Sheet3
Graph IsomorphismA graph is isomorphic if it is topologically
equivalent to another graph
A�
A�
A�
A�

B�
A�
B�
A�
B�
B�
A�
A�
B�
B�
B�
B�
Graph IsomorphismTest for graph isomorphism is
needed:During candidate generation step, to determine whether
a candidate has been generated
During candidate pruning step, to check whether its
(k-1)-subgraphs are frequent
During candidate counting, to check whether a candidate is
contained within another graph
Graph IsomorphismUse canonical labeling to handle
isomorphismMap each graph into an ordered string
representation (known as its code) such that two isomorphic
graphs will be mapped to the same canonical encodingExample:
Lexicographically largest adjacency matrix
String: 0010001111010110

Canonical: 0111101011001000
Session
Id
Country Session
Length
(sec)
Number of
Web Pages
viewed
Gender
Browser
Type
Buy
1 USA 982 8 Male IE No
2 China 811 10 Female Netscape No
3 USA 2125 45 Female Mozilla Yes
4 Germany 596 4 Male IE Yes
5 Australia 123 9 Male Mozilla No
… … … … … … …
10
)
'
(
)
'
(
)
(
)
'
(

)
(
)
'
(
)
(
)
(
2
2
1
1
'
Z
P
z
P
z
P
z
P
z
P
z
P
z
P
Z
E
k
k
Z
´
´
´

´
=
L
)
'
|
'
(
)
'
(
)
(
)
'
(
)
(
)
'
(
)
(
)
|
(
2
2
1
1
X
Y
P
y
P
y

P
y
P
y
P
y
P
y
P
X
Y
E
k
k
´
´
´
´
=
L
2
2
2
1
2
1
'
n
s
n
s
Z
+
D
-
-

=
m
m
11
.
3
250
5
.
6
50
5
.
3
5
23
30
'
2
2
2
2
2
1
2
1
=
+
-
-
=
+
D
-
-
=

n
s
n
s
Z
m
m
TIDW1W2W3W4W5
D122001
D200122
D323000
D400101
D511102
TIDW1W2W3W4W5
D122001
D200122
D323000
D400101
D511102
TIDW1W2W3W4W5
D10.400.330.000.000.17
D20.000.000.331.000.33
D30.400.500.000.000.00
D40.000.000.330.000.17
D50.200.170.330.000.33
å
Î
Î
=
T
i
C
j
j
i
D

C
)
,
(
)
sup(
min
TIDW1W2W3W4W5
D10.400.330.000.000.17
D20.000.000.331.000.33
D30.400.500.000.000.00
D40.000.000.330.000.17
D50.200.170.330.000.33
Food
Bread
Milk
Skim
2%
Electronics
Computers
Home
Desktop
Laptop
Wheat
White
Foremost
Kemps
DVD
TV
Printer
Scanner
Accessory
101520253035
2
3
5

6
1
1
Timeline
Object A:
Object B:
Object C:
4
5
6
27
8
1
2
1
6
1
7
8
126
4
9
:
Answer
=
÷
÷
ø
ö
ç
ç
è
æ
=
÷
÷

ø
ö
ç
ç
è
æ
k
n
Object
Timestamp
Events
A
1
1,2,4
A
2
2,3
A
3
5
B
1
1,2
B
2
2,3,4
C
1
1, 2
C
2
2,3,4
C
3
2,4,5
D

1
2
D
2
3, 4
D
3
4, 5
E
1
1, 3
E
2
2, 4, 5
< {1} {2} {3} >
< {1} {2 5} >
< {1} {5} {3} >
< {2} {3} {4} >
< {2 5} {3} >
< {3} {4} {5} >
< {5} {3 4} >
< {1} {2} {3} {4} >
< {1} {2 5} {3} >
< {1} {5} {3 4} >
< {2} {3} {4} {5} >
< {2 5} {3 4} >
< {1} {2 5} {3} >
Frequent
3-sequences
Candidate
Generation
Candidate
Pruning
p
Object's Timeline
Sequence: (p) (q)

Method Support
Count
COBJ
1
1
CWIN6
CMINWIN
4
pq
p
qq
p
qq
p
234567
CDIST_O8
CDIST5
Databases
Homepage
Research
Artificial
Intelligence
Data Mining
a
b
a
c
c
b
(a) Labeled Graph
p
q
p
p
r
s

t
r
t
q
p
a
a
c
b
(b) Subgraph
p
s
t
p
a
a
c
b
(c) Induced Subgraph
p
r
s
t
r
p
Transaction
Id
Items
1{A,B,C,D}
2{A,B,E}
3{B,C}
4{A,B,D,E}
5{B,C,D}
A
B
C

D
E
TID = 1:
a
b
e
c
p
q
r
p
a
b
d
p
r
G1
G2
q
e
c
a
p
q
r
b
p
G3
d
r
d
r
(a,b,p)
(a,b,q)
(a,b,r)
(b,c,p)

(b,c,q)
(b,c,r)
…
(d,e,r)
G1
1
0
0
0
0
1
…
0
G2
1
0
0
0
0
0
…
0
G3
0
0
1
1
0
0
…
0
G3
…
…
…
…

…
…
…
…
a
a
e
a
p
q
r
p
a
a
a
p
r
r
d
G1
G2
p

000
00
00
0
1
q
rp
rp
qpp
M
G
000
0
00
00
2
r
rrp
rp
pp

M
G
a
a
a
p
q
r
e
p
0000
0000
00
000
00
3
q
r
rrp

rp
qpp
M
G
G3 = join(G1,G2)
d
r
+
a
a
f
a
p
q
r
p
a
a
a
p
r
r
f
G1
G2
p
a
a
a
p
q
r
f
p
G3 = join(G1,G2)
r

+
a
b
e
c
p
q
r
p
a
b
d
p
r
G1G2
q
e
c
a
p
q
r
b
p
G3
d
r
d
r
(a,b,p)(a,b,q)(a,b,r)(b,c,p)(b,c,q)(b,c,r)…(d,e,r)
G1100001…0
G2100000…0
G3001100…0
G4000000…0
ae
q

c
d
p
p
p
G4
r
p
abcde
k=1
Frequent
Subgraphs
ab
p
cd
p
ce
q
ae
r
bd
p
ab
d
r
p
dc
e
p
(Pruned candidate)
Minimum support count = 2
k=2
Frequent
Subgraphs
k=3
Candidate

Subgraphs
a
a
e
a
p
q
r
p
a
a
a
p
r
r
d
G1
G2
p
000
00

00
0
1
q
rp
rp
qpp
M
G
000
0
00
00
2
r
rrp
rp
pp
M
G
a

a
a
p
q
r
e
p
0?00
?000
00
000
00
3
q
r
rrp
rp
qpp
M

G
G3 = join(G1,G2)
d
r
?
+
a
b
e
c
a
b
e
c
+
a
b
e
c
e
a
b
e
c
+
a
a
a
a
c
b
a
a
a
a
c

a
a
a
a
c
b
b
a
a
a
a
b
a
a
a
a
c
a
a
b
+
a
a
a
a
b
a
a
b
a
a
a
b
a
a
a

b
a
b
a
a
b
a
a
A(1)
A(2)
B (6)
A(4)
B (5)
A(3)
B (7)
B (8)
A(1)A(2)A(3)A(4)B(5)B(6)B(7)B(8)
A(1)
11101000
A(2)
11010100
A(3)
10110010
A(4)
01110001
B(5)
10001110
B(6)
01001101
B(7)
00101011
B(8)
00010111
A(2)
A(1)
B (6)

A(4)
B (7)
A(3)
B (5)
B (8)
A(1)A(2)A(3)A(4)B(5)B(6)B(7)B(8)
A(1)
11010100
A(2)
11100010
A(3)
01111000
A(4)
10110001
B(5)
00101011
B(6)
10000111
B(7)
01001110
B(8)
00011101
A
A
AA
BA
B
A
B
B
A
A
BB
B
B
ú

ú
ú
ú
û
ù
ê
ê
ê
ê
ë
é
0
1
1
0
1
0
1
1
1
1
0
0
0
1
0
0
ú
ú
ú
ú
û
ù
ê
ê
ê

ê
ë
é
0
0
0
1
0
0
1
1
0
1
0
1
1
1
1
0

Data Mining Association Analysis Basic Concepts a

Recommended

Recommended

More Related Content

Similar to Data Mining Association Analysis Basic Concepts a

Similar to Data Mining Association Analysis Basic Concepts a (20)

More from OllieShoresna

More from OllieShoresna (20)

Recently uploaded

Recently uploaded (20)

Data Mining Association Analysis Basic Concepts a