Data Mining
Association Analysis: Basic Concepts
and Algorithms
Lecture Notes for Chapter 6
Introduction to Data Mining
by
Tan, Steinbach, Kumar
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Association Rule MiningGiven a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction
Market-Basket transactions
Example of Association Rules
{Diaper} {Beer},
{Milk, Bread} {Eggs,Coke},
{Beer, Bread} {Milk},
Implication means co-occurrence, not causality!
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
TIDItems
1
Bread, Milk
2
Bread, Diaper, Beer, Eggs
3
Milk, Diaper, Beer, Coke
4
Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
Definition: Frequent ItemsetItemsetA collection of one or more itemsExample: {Milk, Bread, Diaper}k-itemsetAn itemset that contains k itemsSupport count ()Frequency of occurrence of an itemsetE.g. ({Milk, Bread,Diaper}) = 2 SupportFraction of transactions that contain an itemsetE.g. s({Milk, Bread, Diaper}) = 2/5Frequent ItemsetAn itemset whose support is greater than or equal to a minsup threshold
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
TIDItems
1
Bread, Milk
2
Bread, Diaper, Beer, Eggs
3
Milk, Diaper, Beer, Coke
4
Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
Definition: Association RuleAssociation RuleAn implication expression of the form X Y, where X and Y are itemsetsExample:
{Milk, Diaper} {Beer}
Rule Evaluation MetricsSupport (s)Fraction of transactions that contain both X and YConfidence (c)Measures how often items in Y
appear in transactions that
contain X
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
TIDItems
1
Bread, Milk
2
Bread, Diaper, Beer, Eggs
3
Milk, Diaper, Beer, Coke
4
Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
Association Rule Mining TaskGiven a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup thresholdconfidence ≥ minconf threshold
Brute-force approach:List all possible association rulesCompute the support and confidence for each rulePrune rules that fail the minsup and minconf thresholds
Computationally prohibitive!
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
Mining Association Rules
Example of Rules:
{Milk,Diaper} {Beer} (s=0.4, c=0.67)
{Milk,Beer} {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} {Milk} (s=0.4, c=0.67)
{Beer} {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} {Milk,Beer} (s=0.4, c=0.5)
{Milk} {Diaper,Beer} (s=0.4, c=0.5)
Observations: All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer} Rules originating from the same itemset have identical support but
can have different confidence Thus, we may decouple the support and confidence requirements
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002
TIDItems
1
Bread, Milk
2
Bread, Diaper, Beer, Eggs
3
Milk, Diap ...
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Data Mining Association Analysis Basic Concepts a
1. Data Mining
Association Analysis: Basic Concepts
and Algorithms
Lecture Notes for Chapter 6
Introduction to Data Mining
by
Tan, Steinbach, Kumar
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Association Rule MiningGiven a set of transactions, find rules
that will predict the occurrence of an item based on the
occurrences of other items in the transaction
Market-Basket transactions
Example of Association Rules
2. {Beer, Bread
Implication means co-occurrence, not causality!
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
TIDItems
1
Bread, Milk
2
Bread, Diaper, Beer, Eggs
3
Milk, Diaper, Beer, Coke
4
Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
Definition: Frequent ItemsetItemsetA collection of one or more
itemsExample: {Milk, Bread, Diaper}k-itemsetAn itemset that
of transactions that contain an itemsetE.g. s({Milk, Bread,
Diaper}) = 2/5Frequent ItemsetAn itemset whose support is
greater than or equal to a minsup threshold
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
TIDItems
1
Bread, Milk
2
Bread, Diaper, Beer, Eggs
3
Milk, Diaper, Beer, Coke
4
3. Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
Definition: Association RuleAssociation RuleAn implication
itemsetsExample:
Rule Evaluation MetricsSupport (s)Fraction of transactions that
contain both X and YConfidence (c)Measures how often items
in Y
appear in transactions that
contain X
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
TIDItems
1
Bread, Milk
2
Bread, Diaper, Beer, Eggs
3
Milk, Diaper, Beer, Coke
4
Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
Association Rule Mining TaskGiven a set of transactions T, the
goal of association rule mining is to find all rules having
4. support ≥ minsup thresholdconfidence ≥ minconf threshold
Brute-force approach:List all possible association rulesCompute
the support and confidence for each rulePrune rules that fail the
minsup and minconf thresholds
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Mining Association Rules
Example of Rules:
} (s=0.4, c=0.5)
Observations: All the above rules are binary partitions of the
same itemset:
{Milk, Diaper, Beer} Rules originating from the same
itemset have identical support but
can have different confidence Thus, we may decouple the
support and confidence requirements
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
TIDItems
1
Bread, Milk
2
Bread, Diaper, Beer, Eggs
5. 3
Milk, Diaper, Beer, Coke
4
Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
Mining Association RulesTwo-step approach:
Frequent Itemset Generation
Rule Generation
Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent i temset
Frequent itemset generation is still computationally expensive
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Frequent Itemset Generation
Given d items, there are 2d possible candidate itemsets
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Frequent Itemset GenerationBrute-force approach: Each itemset
in the lattice is a candidate frequent itemsetCount the support of
each candidate by scanning the database
6. Match each transaction against every candidateComplexity ~
O(NMw) => Expensive since M = 2d !!!
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
N�
w�
M�
List of Candidates�
Computational ComplexityGiven d unique items:Total number
of itemsets = 2dTotal number of possible association rules:
If d=6, R = 602 rules
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Frequent Itemset Generation StrategiesReduce the number of
candidates (M)Complete search: M=2dUse pruning techniques
to reduce M
Reduce the number of transactions (N)Reduce size of N as the
size of itemset increasesUsed by DHP and vertical-based mining
algorithms
Reduce the number of comparisons (NM)Use efficient data
structures to store the candidates or transactionsNo need to
match every candidate against every transaction
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
7. Reducing Number of CandidatesApriori principle:If an itemset
is frequent, then all of its subsets must also be frequent
Apriori principle holds due to the following property of the
support measure:
Support of an itemset never exceeds the support of its
subsetsThis is known as the anti-monotone property of support
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Illustrating Apriori Principle
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
Minimum Support = 3
If every subset is considered,
6C1 + 6C2 + 6C3 = 41
With support-based pruning,
6 + 6 + 1 = 13
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
9. {Milk,Beer}
2
{Milk,Diaper}
3
{Beer,Diaper}
3
ItemsetCount
{Bread,Milk,Diaper}
3
Apriori AlgorithmMethod:
Let k=1Generate frequent itemsets of length 1Repeat until no
new frequent itemsets are identifiedGenerate length (k+1)
candidate itemsets from length k frequent itemsetsPrune
candidate itemsets containing subsets of length k that are
infrequent Count the support of each candidate by scanning the
DBEliminate candidates that are infrequent, leaving only those
that are frequent
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Reducing Number of ComparisonsCandidate counting:Scan the
database of transactions to determine the support of each
candidate itemsetTo reduce the number of comparisons, store
the candidates in a hash structure Instead of matching each
transaction against every candidate, match it against candidates
contained in the hashed buckets
10. (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
N�
k�
Buckets�
Hash Structure�
Generate Hash Tree
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2
3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6
8}
You need: Hash function Max leaf size: max number of
itemsets stored in a leaf node (if number of candidate i temsets
exceeds max leaf size, split the node)
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Association Rule Discovery: Hash tree
1,4,7
2,5,8
3,6,9
Hash Function
Candidate Hash Tree
Hash on 1, 4 or 7
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
11. Association Rule Discovery: Hash tree
1,4,7
2,5,8
3,6,9
Hash Function
Candidate Hash Tree
Hash on 2, 5 or 8
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Association Rule Discovery: Hash tree
1,4,7
2,5,8
3,6,9
Hash Function
Candidate Hash Tree
Hash on 3, 6 or 9
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Subset Operation
Given a transaction t, what are the possible subsets of size 3?
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
1 2 3 5 6�
Transaction, t�
2 3 5 6�
3 5 6�
2�
13. 1 5 9
1 3 6
3 4 5
transaction
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Subset Operation Using Hash Tree
1 5 9
1 3 6
3 4 5
transaction
Match transaction against 11 out of 15 candidates
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Factors Affecting ComplexityChoice of minimum support
threshold lowering support threshold results in more frequent
itemsets this may increase number of candidates and max length
of frequent itemsetsDimensionality (number of items) of the
data set more space is needed to store support count of each
item if number of frequent items also increases, both
computation and I/O costs may also increaseSize of database
since Apriori makes multiple passes, run time of algorithm may
increase with number of transactionsAverage transaction width
transaction width increases with denser data setsThis may
increase max length of frequent itemsets and traversals of hash
tree (number of subsets in a transaction increases with its
width)
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
14. Compact Representation of Frequent ItemsetsSome itemsets are
redundant because they have identical support as their supersets
Number of frequent itemsets
Need a compact representation
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Maximal Frequent Itemset
Border
Infrequent Itemsets
Maximal Itemsets
An itemset is maximal frequent if none of its immediate
supersets is frequent
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
null�
AB�
AC�
AD�
AE�
BC�
BD�
BE�
19. ABCD�
(a) Prefix tree�
(b) Suffix tree�
ABCD�
Alternative Methods for Frequent Itemset GenerationTraversal
of Itemset LatticeBreadth-first vs Depth-first
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
(a) Breadth first�
(b) Depth first�
Alternative Methods for Frequent Itemset
GenerationRepresentation of Databasehorizontal vs vertical data
layout
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Horizontal Data Layout�
Vertical Data Layout�
FP-growth AlgorithmUse a compressed representation of the
database using an FP-tree
Once an FP-tree has been constructed, it uses a recursive
divide-and-conquer approach to mine the frequent itemsets
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
20. 2002
FP-tree construction
null
A:1
B:1
null
A:1
B:1
B:1
C:1
D:1
After reading TID=1:
After reading TID=2:
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Sheet1TIDItems1{A,B}2{B,C,D}3{A,C,D,E}4{A,D,E}5{A,B,C
}6{A,B,C,D}7{B,C}8{A,B,C}9{A,B,D}10{B,C,E}
Sheet2
Sheet3
FP-Tree Construction
null
A:7
B:5
B:3
C:3
D:1
C:1
D:1
C:3
21. D:1
D:1
E:1
E:1
Pointers are used to assist frequent itemset generati on
D:1
E:1
Transaction Database
Header table
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Sheet1TIDItems1{A,B}2{B,C,D}3{A,C,D,E}4{A,D,E}5{A,B,C
}6{A,B,C,D}7{B,C}8{A,B,C}9{A,B,D}10{B,C,E}
Sheet2
Sheet3
Sheet1ItemPointerABCDE
Sheet2
Sheet3
FP-growth
null
A:7
B:5
B:1
C:1
D:1
C:1
D:1
C:3
D:1
D:1
Conditional Pattern base for D:
22. P = {(A:1,B:1,C:1),
(A:1,B:1),
(A:1,C:1),
(A:1),
(B:1,C:1)}
Recursively apply FP-growth on P
Frequent Itemsets found (with sup > 1):
AD, BD, CD, ACD, BCD
D:1
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Tree Projection
Set enumeration tree:
Possible Extension: E(A) = {B,C,D,E}
Possible Extension: E(ABC) = {D,E}
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Tree ProjectionItems are listed in lexicographic orderEach node
P stores the following information:Itemset for node PList of
possible lexicographic extensions of P: E(P)Pointer to projected
database of its ancestor nodeBitvec tor containing information
about which transactions in the projected database contain the
itemset
23. (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Projected Database
Original Database:
Projected Database for node A:
For each transaction T,
E(A)
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Sheet1TIDItems1{A,B}2{B,C,D}3{A,C,D,E}4{A,D,E}5{A,B,C
}6{A,B,C,D}7{B,C}8{A,B,C}9{A,B,D}10{B,C,E}
Sheet2
Sheet3
Sheet1TIDItems1{B}2{}3{C,D,E}4{D,E}5{B, C}6{B,C,D}7{}8
{B,C}9{B,D}10{}
Sheet2
Sheet3
ECLATFor each item, store a list of transaction ids (tids)
TID-list
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
24. ECLATDetermine support of any k-itemset by intersecting tid-
lists of two of its (k-1) subsets.
3 traversal approaches: top-down, bottom-up and
hybridAdvantage: very fast support countingDisadvantage:
intermediate tid-lists may become too large for memory
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Sheet1TIDItemsABCDE1A,B,E112212B,C,D423433C,E554564
A,C,D67895A,B,C,D7896A,E8107A,B98A,B,C9A,C,D10BAB11
425567788109
Sheet2
Sheet3
Sheet1TIDItemsABCDE1A,B,E112212B,C,D423433C,E554564
A,C,D67895A,B,C,D7896A,E8107A,B98A,B,C9A,C,D10BAB11
425567788109
Sheet2
Sheet3
Sheet1TIDItemsABCDE1A,B,E112212B,C,D423433C,E554564
A,C,D67895A,B,C,D7896A,E8107A,B98A,B,C9A,C,D10BABA
B111425557678788109
Sheet2
Sheet3
Rule GenerationGiven a frequent itemset L, find all non-empty
25. – f satisfies the minimum
confidence requirementIf {A,B,C,D} is a frequent itemset,
candidate rules:
If |L| = k, then there are 2k – 2 candidate association rules
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Rule GenerationHow to efficiently generate rules from frequent
itemsets?In general, confidence does not have an anti-monotone
property
But confidence of rules generated from the same itemset has an
anti-monotone propertye.g., L = {A,B,C,D}:
Confidence is anti-monotone w.r.t. number of items on the RHS
of the rule
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Rule Generation for Apriori Algorithm
Lattice of rules
Low Confidence Rule
26. (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
ABCD=>{ }�
BC=>AD�
BD=>AC�
CD=>AB�
AD=>BC�
AC=>BD�
AB=>CD�
D=>ABC�
C=>ABD�
B=>ACD�
A=>BCD�
ACD=>B�
ABD=>C�
ABC=>D�
BCD=>A�
Rule Generation for Apriori AlgorithmCandidate rule is
generated by merging two rules that share the same prefix
in the rule consequent
join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
Prune rule D=>ABC if its
subset AD=>BC does not have
high confidence
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
27. 2002
Effect of Support DistributionMany real data sets have skewed
support distribution
Support distribution of a retail data set
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Effect of Support DistributionHow to set the appropriate minsup
threshold?If minsup is set too high, we could miss itemsets
involving interesting rare items (e.g., expensive products)
If minsup is set too low, it is computationally expensive and the
number of itemsets is very large
Using a single minimum support threshold may not be effective
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Multiple Minimum SupportHow to apply multiple minimum
supports?MS(i): minimum support for item i e.g.:
MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%MS({Milk,
Broccoli}) = min (MS(Milk), MS(Broccoli))
= 0.1%
Challenge: Support is no longer anti-monotone Suppose:
Support(Milk, Coke) = 1.5% and
Support(Milk, Coke, Broccoli) = 0.5%
28. {Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is
frequent
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Multiple Minimum Support
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Multiple Minimum Support
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Multiple Minimum Support (Liu 1999)Order the items according
to their minimum support (in ascending order)e.g.:
MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%Ordering:
Broccoli, Salmon, Coke, Milk
Need to modify Apriori such that:L1 : set of frequent itemsF1 :
where MS(1) is mini( MS(i) )C2 : candidate itemsets
of size 2 is generated from F1
instead of L1
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
29. Multiple Minimum Support (Liu 1999)Modifications to
Apriori:In traditional Apriori, A candidate (k+1)-itemset is
generated by merging two
frequent itemsets of size k The candidate is pruned if it
contains any infrequent subsets
of size kPruning step has to be modified: Prune only if subset
contains the first item e.g.: Candidate={Broccoli, Coke, Milk}
(ordered according to
minimum support) {Broccoli,
Coke} and {Broccoli, Milk} are frequent but
{Coke, Milk} is infrequent
Candidate is not pruned because {Coke,Milk} does not contain
the first item, i.e., Broccoli.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Pattern EvaluationAssociation rule algorithms tend to produce
too many rules many of them are uninteresting or
have same support & confidence
Interestingness measures can be used to prune/rank the derived
patterns
In the original formulation of association rules, support &
confidence are the only measures used
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
30. 2002
Application of Interestingness Measure
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
information needed to compute rule interestingness can be
obtained from a contingency table
Contingency table
Used to define various measures support, confidence, lift, Gini,
J-measure, etc.YY Xf11f10f1+X f01f00fo+f+1f+0|T|
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Drawback of Confidence
Coffee
CoffeeTea15520Tea755809010100
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Statistical IndependencePopulation of 1000 students600
students know how to swim (S)700 students know how to bike
(B)420 students know how to swim and bike (S,B)
31. Negatively correlated
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Statistical-based MeasuresMeasures that take into account
statistical dependence
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Example: Lift/Interest
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.9 Lift = 0.75/0.9= 0.8333 (< 1, therefore is
negatively associated)
Coffee
CoffeeTea15520Tea755809010100
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Drawback of Lift & Interest
Statistical independence:
If P(X,Y)=P(X)P(Y) => Lift =
1YYX10010X090901090100YYX90090X010109010100
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
32. There are lots of measures proposed in the literature
Some measures are good for certain applications, but not for
others
What criteria should we use to determine whether a measure is
good or bad?
What about Apriori-style support based pruning? How does it
affect these measures?
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Properties of A Good MeasurePiatetsky-Shapiro:
3 properties a good measure M must satisfy:M(A,B) = 0 if A
and B are statistically independent
M(A,B) increase monotonically with P(A,B) when P(A) and
P(B) remain unchanged
M(A,B) decreases monotonically with P(A) [or P(B)] when
P(A,B) and P(B) [or P(A)] remain unchanged
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Comparing Different Measures
10 examples of contingency tables:
Rankings of contingency tables using various measures:
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
34. in the samples
2x
10xMaleFemaleHigh235Low1453710MaleFemaleHigh43034Lo
w2404267076
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Property under Inversion Operation
Transaction 1
Transaction N
.
.
.
.
.
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
- -coefficient is analogous to
correlation coefficient for continuous variables
tablesYYX601070X1020307030100YYX201030X106070307010
0
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Property under Null Addition
36. invarianceYes*: Yes if measure is normalizedYes**: Yes if
measure is symmetrized by taking max(M(A,B),M(B,A))No*:
Symmetry under row or column permutation
Sheet2
Sheet3
3
3
2
0
3
1
3
2
1
3
2
K
K
÷
÷
ø
ö
ç
ç
è
æ
-
-
÷
÷
ø
ö
ç
ç
è
37. æ
-
MBD01E1AF4B.unknown
Support-based PruningMost of the association rule mining
algorithms use support measure to prune rules and itemsets
Study effect of support pruning on correlation of
itemsetsGenerate 10000 random contingency tablesCompute
support and pairwise correlation for each tableApply support-
based pruning and examine the tables that are removed
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Effect of Support-based Pruning
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Chart113951722823595036487168308679089158817596025323
702591948411
Correlation
All Itempairs
Chart20047972535428692755428141635000
Correlation
Support > 0.9
Chart372970951501822672963363833753703322722021618949
41174
Correlation
Support > 0.7
Chart413921642653274285425726406656646786455404063552
37139118515
Correlation
Support > 0.5
38. Chart510212218282846403325217210000000
Correlation
Support < 0.01
Chart61371686667849493937756301453000000
Correlation
Support < 0.03
Chart71391104106107128141151161136996745158421000
Correlation
Support < 0.05
Sheet1CorrelationAll> 0.9> 0.7> 0.5< 0.01< 0.03< 0.05-
1130713101313-0.99502992217191-0.81724701642268104-
0.72827952651866106-0.635991503272867107-
0.550371824282884128-0.4648252675424694141-
0.3716352965724093151-0.2830423366403393161-
0.18678638366525771360908923756642156990.191575370678
730670.288154332645214450.37592827254015150.4602142024
060380.5532161613550040.63703892370020.72595491390010.
81940411180000.98401751000111045000
Sheet2
Sheet3
Effect of Support-based Pruning
Support-based pruning eliminates mostly negatively correlated
itemsets
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Chart113951722823595036487168308679089 158817596025323
702591948411
Correlation
All Itempairs
Chart20047972535428692755428141635000
Correlation
Support > 0.9
39. Chart372970951501822672963363833753703322722021618949
41174
Correlation
Support > 0.7
Chart413921642653274285425726406656646786455404063552
37139118515
Correlation
Support > 0.5
Chart510212218282846403325217210000000
Correlation
Support < 0.01
Chart61371686667849493937756301453000000
Correlation
Support < 0.03
Chart71391104106107128141151161136996745158421000
Correlation
Support < 0.05
Sheet1CorrelationAll> 0.9> 0.7> 0.5< 0.01< 0.03< 0.05-
1130713101313-0.99502992217191-0.81724701642268104-
0.72827952651866106-0.635991503272867107-
0.550371824282884128-0.4648252675424694141-
0.3716352965724093151-0.2830423366403393161-
0.18678638366525771360908923756642156990.191575370678
730670.288154332645214450.37592827254015150.4602142024
060380.5532161613550040.63703892370020.72595491390010.
81940411180000.98401751000111045000
Sheet2
Sheet3
Chart113951722823595036487168308679089158817596025323
702591948411
Correlation
All Itempairs
Chart20047972535428692755428141635000
Correlation
Support > 0.9
40. Chart372970951501822672963363833753703322722021618949
41174
Correlation
Support > 0.7
Chart413921642653274285425726406656646786455404063552
37139118515
Correlation
Support > 0.5
Chart510212218282846403325217210000000
Correlation
Support < 0.01
Chart61371686667849493937756301453000000
Correlation
Support < 0.03
Chart71391104106107128141151161136996745158421000
Correlation
Support < 0.05
Sheet1CorrelationAll> 0.9> 0.7> 0.5< 0.01< 0.03< 0.05-
1130713101313-0.99502992217191-0.81724701642268104-
0.72827952651866106-0.635991503272867107-
0.550371824282884128-0.4648252675424694141-
0.3716352965724093151-0.2830423366403393161-
0.18678638366525771360908923756642156990.191575370678
730670.288154332645214450.37592827254015150.4602142024
060380.5532161613550040.63703892370020.72595491390010.
81940411180000.98401751000111045000
Sheet2
Sheet3
Chart113951722823595036487168308679089158817596025323
702591948411
Correlation
All Itempairs
Chart20047972535428692755428141635000
Correlation
Support > 0.9
41. Chart372970951501822672963363833753703322722021618949
41174
Correlation
Support > 0.7
Chart413921642653274285425726406656646786455404063552
37139118515
Correlation
Support > 0.5
Chart510212218282846403325217210000000
Correlation
Support < 0.01
Chart61371686667849493937756301453000000
Correlation
Support < 0.03
Chart71391104106107128141151161136996745158421000
Correlation
Support < 0.05
Sheet1CorrelationAll> 0.9> 0.7> 0.5< 0.01< 0.03< 0.05-
1130713101313-0.99502992217191-0.81724701642268104-
0.72827952651866106-0.635991503272867107-
0.550371824282884128-0.4648252675424694141-
0.3716352965724093151-0.2830423366403393161-
0.18678638366525771360908923756642156990.191575370678
730670.288154332645214450.37592827254015150.4602142024
060380.5532161613550040.63703892370020.72595491390010.
81940411180000.98401751000111045000
Sheet2
Sheet3
Effect of Support-based PruningInvestigate how support-based
pruning affects other measures
Steps:Generate 10000 contingency tablesRank each table
according to the different measuresCompute the pair-wise
correlation between the measures
42. (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Effect of Support-based Pruning Without Support Pruning (All
Pairs) Red cells indicate correlation between
the pair of measures > 0.85 40.14% pairs have correlation >
0.85
Scatter Plot between Correlation & Jaccard Measure
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Effect of Support-
pairs have correlation > 0.85
Scatter Plot between Correlation & Jaccard Measure:
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Effect of Support-
pairs have correlation > 0.85
Scatter Plot between Correlation & Jaccard Measure
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Subjective Interestingness MeasureObjective measure: Rank
patterns based on statistics computed from datae.g., 21
measures of association (support, confidence, Laplace, Gini,
43. mutual information, Jaccard, etc).
Subjective measure:Rank patterns according to user’s
interpretation A pattern is subjectively interesting if it
contradicts the
expectation of a user (Silberschatz & Tuzhilin) A pattern is
subjectively interesting if it is actionable
(Silberschatz & Tuzhilin)
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Interestingness via UnexpectednessNeed to model expectation
of users (domain knowledge)
Need to combine expectation of users with evidence from data
(i.e., extracted patterns)
+
Pattern expected to be frequent
-
Pattern expected to be infrequent
Pattern found to be frequent
Pattern found to be infrequent
+
-
Expected Patterns
44. -
+
Unexpected Patterns
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
Interestingness via UnexpectednessWeb Data (Cooley et al
2001)Domain knowledge in the form of site structureGiven an
itemset F = {X1, X2, …, Xk} (Xi : Web pages) L: number of
links connecting the pages lfactor = L / -1) cfactor = 1
(if graph is connected), 0 (disconnected graph)Structure
Usage evidence
Use Dempster-Shafer theory to combine domain knowledge and
evidence from data
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Beer
}
Diaper
,
88. Data Mining
Association Rules: Advanced Concepts and Algorithms
Lecture Notes for Chapter 7
Introduction to Data Mining
by
Tan, Steinbach, Kumar
Continuous and Categorical Attributes
Example of Association Rule:
= No}
How to apply association analysis formulation to non-
asymmetric binary variables?
Session Id
Country
Session Length (sec)
Number of Web Pages viewed
Gender
Browser Type
Buy
1
USA
982
90. …
…
…
…
…
10
Handling Categorical AttributesTransform categorical attri bute
into asymmetric binary variables
Introduce a new “item” for each distinct attribute-value
pairExample: replace Browser Type attribute with Browser Type
= Internet Explorer Browser Type = Mozilla Browser Type =
Mozilla
Handling Categorical AttributesPotential IssuesWhat if attribute
has many possible values Example: attribute country has more
than 200 possible values Many of the attribute values may have
very low support
Potential solution: Aggregate the low-support attribute values
What if distribution of attribute values is highly skewed
Example: 95% of the visitors have Buy = No Most of the items
will be associated with (Buy=No) item
Potential solution: drop the highly frequent items
Handling Continuous AttributesDifferent kinds of
Different methods:Discretization-basedStatistics-basedNon-
discretization based minApriori
91. Handling Continuous AttributesUse
discretizationUnsupervised:Equal-width binningEqual-depth
binningClustering
Supervised:
bin1
bin3
bin2
Attribute values,
vClassv1v2v3v4v5v6v7v8v9Anomalous002010200000Normal15
0100000100100150100
Discretization IssuesSize of the discretized intervals affect
support & confidence
If intervals too small may not have enough supportIf intervals
too large may not have enough confidencePotential solution: use
all possible intervals
Discretization IssuesExecution timeIf intervals contain n values,
there are on average O(n2) possible ranges
92. Too many rules
{Refund =
Approach by Srikant & AgrawalPreprocess the dataDiscretize
attribute using equi-depth partitioning Use partial completeness
measure to determine number of partitions Merge adjacent
intervals as long as support is less than max-support
Apply existing association rule mining algorithms
Determine interesting rules in the output
Approach by Srikant & AgrawalDiscretization will lose
information
Use partial completeness measure to determine how much
information is lost
C: frequent itemsets obtained by considering all
ranges of attribute values
P: frequent itemsets obtained by considering all ranges
over the partitions
P is K-
such that:
1. X’ is a generaliz
93. support(Y)
Given K (partial completeness level), can determine
number of intervals (N)
X
Approximated X
Interestingness Measure
Given an itemset: Z = {z1, z2, …, zk} and its generalization Z’
= {z1’, z2’, …, zk’}
P(Z): support of Z
EZ’(Z): expected support of Z based on Z’
Z is R-
{Refund = N
ES’(Y|X): expected support of Z based on Z’
94. Rule S is R-interesting w.r.t its ancestor rule S’ if Support, P(S)
Statistics-based MethodsExample:
consequent consists of a continuous variable, characterized by
their statistics mean, median, standard deviation,
etc.Approach:Withhold the target variable from the rest of the
dataApply existing frequent itemset generation on the rest of the
dataFor each frequent itemset, compute the descriptive statistics
for the corresponding target variable Frequent itemset becomes
a rule by introducing the target variable as rule
consequentApply statistical test to determine interestingness of
the rule
Statistics-based MethodsHow to determine whether an
association rule interesting?Compare the statistics for segment
of population covered by the rule vs segment of population not
covered by the rule:
variance 1 under null hypothesis
Statistics-based MethodsExample:
95. years (i n1 = 50, s1 = 3.5For r’
(complement): n2 = 250, s2 = 6.5
For 1-sided test at 95% confidence level, critical Z-value for
rejecting null hypothesis is 1.64.Since Z is greater than 1.64, r
is an interesting rule
Min-Apriori (Han et al)
Example:
W1 and W2 tends to appear together in the same document
Document-term matrix:
Sheet1TIDW1W2W3W4W5TIDW1W2W3W4W5D122001D10.4
0.40.00.00.2D200122D20.00.00.20.40.4D323000D30.40.60.00.0
0.0D400101D40.00.00.50.00.5D511102D50.20.20.20.00.4
Sheet2
Sheet3
Min-AprioriData contains only continuous attributes of the
same “type”e.g., frequency of words in a document
Potential solution:Convert into 0/1 matrix and then apply
existing algorithms lose word frequency
informationDiscretization does not apply as users want
association among words not ranges of words
96. Sheet1TIDW1W2W3W4W5TIDW1W2W3W4W5D122001D10.4
0.40.00.00.2D200122D20.00.00.20.40.4D323000D30.40.60.00.0
0.0D400101D40.00.00.50.00.5D511102D50.20.20.20.00.4
Sheet2
Sheet3
Min-AprioriHow to determine the support of a word?If we
simply sum up its frequency, support count will be greater than
total number of documents! Normalize the word vectors – e.g.,
using L1 norm Each word has a support equals to 1.0
Normalize
Sheet1TIDW1W2W3W4W5TIDW1W2W3W4W5D122001D10.4
0.40.00.00.2D200122D20.00.00.20.40.4D323000D30.40.60.00.0
0.0D400101D40.00.00.50.00.5D511102D50.20.20.20.00.4
Sheet2
Sheet3
Sheet1TIDW1W2W3W4W5TIDW1W2W3W4W5D122001D10.4
00.330.000.000.17D200122D20.000.000.331.000.33D323000D3
0.400.500.000.000.00D400101D40.000.000.330.000.17D511102
D50.200.170.330.000.33
Sheet2
Sheet3
Min-AprioriNew definition of support:
98. Multi-level Association RulesWhy should we incorporate
concept hierarchy?Rules at lower levels may not have enough
support to appear in any frequent itemsets
Rules at lower levels of the hierarchy are overly specific e.g.,
are indicative of association between milk and bread
Multi-level Association RulesHow do support and confidence
vary as we traverse the concept hierarchy?If X is the parent
item for both X1 and X2, then
If
and X is parent of X1, Y is parent of Y1
then
If
then
Multi-level Association RulesApproach 1:Extend current
association rule formulation by augmenting each transaction
with higher level items
99. Original Transaction: {skim milk, wheat bread}
Augmented Transaction:
{skim milk, wheat bread, milk, bread, food}
Issues:Items that reside at higher levels have much higher
support counts if support threshold is low, too many frequent
patterns involving items from the higher levelsIncreased
dimensionality of the data
Multi-level Association RulesApproach 2:Generate frequent
patterns at highest level first
Then, generate frequent patterns at the next highest level, and
so on
Issues:I/O requirements will increase dramatically because we
need to perform more passes over the dataMay miss some
potentially interesting cross-level association patterns
Sequence Data
Sequence Database:
10�
15�
20�
25�
30�
35�
2�
3�
5�
6�
1�
101. E4
E2
Element (Transaction)
Event
(Item)Sequence DatabaseSequenceElement (Transaction)Event
(Item)CustomerPurchase history of a given customerA set of
items bought by a customer at time tBooks, diary products,
CDs, etcWeb DataBrowsing activity of a particular Web
visitorA collection of files viewed by a Web visitor after a
single mouse clickHome page, index page, contact info,
etcEvent dataHistory of events generated by a given
sensorEvents triggered by a sensor at time tTypes of alarms
generated by sensors Genome sequencesDNA sequence of a
particular speciesAn element of the DNA sequence Bases
A,T,G,C
Formal Definition of a SequenceA sequence is an ordered list of
elements (transactions)
s = < e1 e2 e3 … >
Each element contains a collection of events (items)
ei = {i1, i2, …, ik}
Each element is attributed to a specific time or location
Length of a sequence, |s|, is given by the number of elements of
the sequence
A k-sequence is a sequence that contains k events (items)
Examples of SequenceWeb sequence:
102. < {Homepage} {Electronics} {Digital Cameras} {Canon
Digital Camera} {Shopping Cart} {Order Confirmation}
{Return to Shopping} >
Sequence of initiating events causing the nuclear accident at 3-
mile Island:
(http://stellar-
one.com/nuclear/staff_reports/summary_SOE_the_initiating_eve
nt.htm)
< {clogged resin} {outlet valve closure} {loss of feedwater}
{condenser polisher outlet valve shut} {booster pumps trip}
{main waterpump trips} {main turbine trips} {reactor pressure
increases}>
Sequence of books checked out at a library:
<{Fellowship of the Ring} {The Two Towers} {Return of the
King}>
Formal Definition of a SubsequenceA sequence <a1 a2 … an> is
contained in another sequence <b1 b2 … bm> (m ≥ n) if there
exist integers
i1 < i2 < … < in s
The support of a subsequence w is defined as the fraction of
data sequences that contain wA sequential pattern is a frequent
subsequence (i.e., a subsequence whose support is ≥
minsup)Data sequenceSubsequenceContain?< {2,4} {3,5,6} {8}
>< {2} {3,5} >Yes< {1,2} {3,4} > < {1} {2} >No< {2,4} {2,4}
103. {2,5} >< {2} {4} >Yes
Sequential Pattern Mining: DefinitionGiven: a database of
sequences a user-specified minimum support threshold, minsup
Task:Find all subsequences with support ≥ minsup
Sequential Pattern Mining: ChallengeGiven a sequence: <{a b}
{c d e} {f} {g h i}>Examples of subsequences:
<{a} {c d} {f} {g} >, < {c d e} >, < {b} {g} >, etc.
How many k-subsequences can be extracted from a given n-
sequence?
<{a b} {c d e} {f} {g h i}> n = 9
k=4: Y _ _ Y Y _ _ _ Y
<{a} {d e} {i}>
Sequential Pattern Mining: Example
Minsup = 50%
Examples of Frequent Subsequences:
< {1,2} > s=60%
< {2,3} > s=60%
< {2,4}> s=80%
< {3} {5}> s=80%
< {1} {2} > s=80%
104. < {2} {2} > s=60%
< {1} {2,3} > s=60%
< {2} {2,3} > s=60%
< {1,2} {2,3} > s=60%
Extracting Sequential PatternsGiven n events: i1, i2, i3, …, in
Candidate 1-subsequences:
<{i1}>, <{i2}>, <{i3}>, …, <{in}>
Candidate 2-subsequences:
<{i1, i2}>, <{i1, i3}>, …, <{i1} {i1}>, <{i1} {i2}>, …, <{in-
1} {in}>
Candidate 3-subsequences:
<{i1, i2 , i3}>, <{i1, i2 , i4}>, …, <{i1, i2} {i1}>, <{i1, i2}
{i2}>, …,
<{i1} {i1 , i2}>, <{i1} {i1 , i3}>, …, <{i1} {i1} {i1}>, <{i1}
{i1} {i2}>, …
Generalized Sequential Pattern (GSP)Step 1: Make the first pass
over the sequence database D to yield all the 1-element frequent
sequences
Step 2:
Repeat until no new frequent sequences are
foundCandidate Generation: Merge pairs of frequent
subsequences found in the (k-1)th pass to generate candidate
sequences that contain k items
Candidate Pruning:Prune candidate k-sequences that contain
infrequent (k-1)-subsequences
Support Counting:Make a new pass over the sequence database
D to find the support for these candidate sequences
Candidate Elimination:Eliminate candidate k-sequences whose
actual support is less than minsup
105. Candidate GenerationBase case (k=2): Merging two frequent 1-
sequences <{i1}> and <{i2}> will produce two candidate 2-
sequences: <{i1} {i2}> and <{i1 i2}>
General case (k>2):A frequent (k-1)-sequence w1 is merged
with another frequent
(k-1)-sequence w2 to produce a candidate k-sequence if the
subsequence obtained by removing the first event in w 1 is the
same as the subsequence obtained by removing the last event in
w2 The resulting candidate after merging is given by the
sequence w1 extended with the last event of w2.
If the last two events in w2 belong to the same element, then the
last event in w2 becomes part of the last element in w1
Otherwise, the last event in w2 becomes a separate element
appended to the end of w1
Candidate Generation ExamplesMerging the sequences
w1=<{1} {2 3} {4}> and w2 =<{2 3} {4 5}>
will produce the candidate sequence < {1} {2 3} {4 5}> because
the last two events in w2 (4 and 5) belong to the same element
Merging the sequences
w1=<{1} {2 3} {4}> and w2 =<{2 3} {4} {5}>
will produce the candidate sequence < {1} {2 3} {4} {5}>
because the last two events in w2 (4 and 5) do not belong to the
same element
We do not have to merge the sequences
106. w1 =<{1} {2 6} {4}> and w2 =<{1} {2} {4 5}>
to produce the candidate < {1} {2 6} {4 5}> because if the
latter is a viable candidate, then it can be obtained by merging
w1 with
< {1} {2 6} {5}>
GSP Example
< {1} {2} {3} >
< {1} {2 5} >�< {1} {5} {3} >
< {2} {3} {4} >
< {2 5} {3} >
< {3} {4} {5} >
< {5} {3 4} >�
< {1} {2} {3} {4} >
< {1} {2 5} {3} >
< {1} {5} {3 4} >�< {2} {3} {4} {5} >�< {2 5} {3 4} >�
< {1} {2 5} {3} >�
Frequent �3-sequences�
Candidate�Generation�
Candidate�Pruning�
Timing Constraints (I)
{A B} {C} {D E}
<= ms
<= xg
>ng
xg: max-gap
ng: min-gap
107. ms: maximum span
xg = 2, ng = 0, ms= 4Data sequenceSubsequenceContain?<
{2,4} {3,5,6} {4,7} {4,5} {8} >< {6} {5} >Yes< {1} {2} {3}
{4} {5}>< {1} {4} >No< {1} {2,3} {3,4} {4,5}>< {2} {3} {5}
> Yes< {1,2} {3} {2,3} {3,4} {2,4} {4,5}>< {1,2} {5} >No
Mining Sequential Patterns with Timing ConstraintsApproach
1:Mine sequential patterns without timing
constraintsPostprocess the discovered patterns
Approach 2:Modify GSP to directly prune candidates that
violate timing constraintsQuestion: Does Apriori principle still
hold?
Apriori Principle for Sequence Data
Suppose:
xg = 1 (max-gap)
ng = 0 (min-gap)
ms = 5 (maximum span)
minsup = 60%
<{2} {5}> support = 40%
but
<{2} {3} {5}> support = 60%
Problem exists because of max-gap constraint
No such problem if max-gap is infinite
Contiguous Subsequencess is a contiguous subsequence of
w = <e1>< e2>…< ek>
108. if any of the following conditions hold:
s is obtained from w by deleting an item from either e1 or ek
s is obtained from w by deleting an item from any element ei
that contains more than 2 items
s is a contiguous subsequence of s’ and s’ is a contiguous
subsequence of w (recursive definition)
Examples: s = < {1} {2} > is a contiguous subsequence of
< {1} {2 3}>, < {1 2} {2} {3}>, and < {3 4} {1 2} {2 3}
{4} > is not a contiguous subsequence of
< {1} {3} {2}> and < {2} {1} {3} {2}>
Modified Candidate Pruning StepWithout maxgap constraint:A
candidate k-sequence is pruned if at least one of its (k-1)-
subsequences is infrequent
With maxgap constraint:A candidate k-sequence is pruned if at
least one of its contiguous (k-1)-subsequences is infrequent
Timing Constraints (II)
xg: max-gap
ng: min-gap
ws: window size
ms: maximum span
xg = 2, ng = 0, ws = 1, ms= 5Data
sequenceSubsequenceContain?< {2,4} {3,5,6} {4,7} {4,6} {8}
>< {3} {5} >No< {1} {2} {3} {4} {5}>< {1,2} {3} > Yes<
{1,2} {2,3} {3,4} {4,5}>< {1,2} {3,4} >Yes
Modified Support Counting StepGiven a candidate pattern: <{a,
109. c}>Any data sequences that contain
<… {a c} … >,
<… {a} … {c}…> ( where time({c}) – time({a}) ≤ ws)
<…{c} … {a} …> (where time({a}) – time({c}) ≤ ws)
will contribute to the support count of candidate pattern
Other FormulationIn some domains, we may have only one very
long time seriesExample: monitoring network traffic events for
attacks monitoring telecommunication alarm signalsGoal is to
find frequent sequences of events in the time seriesThis problem
is also known as frequent episode mining
E1
E2
E1
E2
E1
E2
E3
E4
E3 E4
E1
E2
E2 E4
E3 E5
E2
E3 E5
E1
E2
E3 E1
111. p�
CWIN�
6�
Object's Timeline�
CMINWIN�
4�
CDIST_O�
8�
CDIST�
5�
Frequent Subgraph MiningExtend association rule mining to
finding frequent subgraphsUseful for Web Mining,
computational chemistry, bioinformatics, spatial data sets, etc
Graph Definitions
Representing Transactions as GraphsEach transaction is a clique
of items
Sheet1Transaction
IdItems1{A,B,C,D}2{A,B,E}3{B,C}4{A,B,D,E}5{B,C,D}
Sheet2
Sheet3
112. Representing Graphs as Transactions
ChallengesNode may contain duplicate labelsSupport and
confidenceHow to define them?Additional constraints imposed
by pattern structureSupport and confidence are not the only
constraintsAssumption: frequent subgraphs must be
connectedApriori-like approach: Use frequent k-subgraphs to
generate frequent (k+1) subgraphsWhat is k?
Challenges…Support: number of graphs that contain a particular
subgraph
Apriori principle still holds
Level-wise (Apriori-like) approach:Vertex growing: k is the
number of verticesEdge growing: k is the number of edges
Vertex Growing
a�
a�
e�
a�
p�
p�
q�
r�
p�
114. a�
a�
r�
a�
a�
a�
f�
p�
r�
a�
p�
q�
r�
f�
p�
G1�
G2�
G3 = join(G1,G2)�
r�
Apriori-like AlgorithmFind frequent 1-
subgraphsRepeatCandidate generation Use frequent (k-1)-
subgraphs to generate candidate k-subgraphCandidate pruning
Prune candidate subgraphs that contain infrequent
(k-1)-subgraphs Support counting Count the support of each
remaining candidateEliminate candidate k-subgraphs that are
infrequent
In practice, it is not as easy. There are many other issues
Example: Dataset
117. p�
d�
c�
e�
p�
(Pruned candidate)�
Minimum support count = 2�
k=2
Frequent�Subgraphs�
Candidate GenerationIn Apriori:Merging two frequent k-
itemsets will produce a candidate (k+1)-itemset
In frequent subgraph mining (vertex/edge growing)Merging two
frequent k-subgraphs may produce more than one candidate
(k+1)-subgraph
Multiplicity of Candidates (Vertex Growing)
a�
a�
e�
a�
p�
p�
q�
r�
p�
a�
a�
r�
a�
a�
120. Multiplicity of Candidates (Edge growing)Case 3: Core
multiplicity
Adjacency Matrix Representation The same graph can be
represented in many ways
Sheet1A(1)A(2)A(3)A(4)B(5)B(6)B(7)B(8)A(1)11101000A(2)1
1010100A(3)10110010A(4)01110001B(5)10001110B(6)0100110
1B(7)00101011B(8)00010111A(1)A(2)A(3)A(4)B(5)B(6)B(7)B(
8)A(1)11010100A(2)11100010A(3)01111000A(4)10110001B(5)
00101011B(6)10000111B(7)01001110B(8)00011101
Sheet2
Sheet3
Sheet1A(1)A(2)A(3)A(4)B(5)B(6)B(7)B(8)A(1)11101000A(2)1
1010100A(3)10110010A(4)01110001B(5)10001110B(6)0100110
1B(7)00101011B(8)00010111A(1)A(2)A(3)A(4)B(5)B(6)B(7)B(
8)A(1)11010100A(2)11100010A(3)01111000A(4)10110001B(5)
00101011B(6)10000111B(7)01001110B(8)00011101
Sheet2
Sheet3
Graph IsomorphismA graph is isomorphic if it is topologically
equivalent to another graph
A�
A�
A�
A�
121. B�
A�
B�
A�
B�
B�
A�
A�
B�
B�
B�
B�
Graph IsomorphismTest for graph isomorphism is
needed:During candidate generation step, to determine whether
a candidate has been generated
During candidate pruning step, to check whether its
(k-1)-subgraphs are frequent
During candidate counting, to check whether a candidate is
contained within another graph
Graph IsomorphismUse canonical labeling to handle
isomorphismMap each graph into an ordered string
representation (known as its code) such that two isomorphic
graphs will be mapped to the same canonical encodingExample:
Lexicographically largest adjacency matrix
String: 0010001111010110
122. Canonical: 0111101011001000
Session
Id
Country Session
Length
(sec)
Number of
Web Pages
viewed
Gender
Browser
Type
Buy
1 USA 982 8 Male IE No
2 China 811 10 Female Netscape No
3 USA 2125 45 Female Mozilla Yes
4 Germany 596 4 Male IE Yes
5 Australia 123 9 Male Mozilla No
… … … … … … …
10
)
'
(
)
'
(
)
(
)
'
(
147. b
a
b
a
a
b
a
a
A(1)
A(2)
B (6)
A(4)
B (5)
A(3)
B (7)
B (8)
A(1)A(2)A(3)A(4)B(5)B(6)B(7)B(8)
A(1)
11101000
A(2)
11010100
A(3)
10110010
A(4)
01110001
B(5)
10001110
B(6)
01001101
B(7)
00101011
B(8)
00010111
A(2)
A(1)
B (6)
148. A(4)
B (7)
A(3)
B (5)
B (8)
A(1)A(2)A(3)A(4)B(5)B(6)B(7)B(8)
A(1)
11010100
A(2)
11100010
A(3)
01111000
A(4)
10110001
B(5)
00101011
B(6)
10000111
B(7)
01001110
B(8)
00011101
A
A
AA
BA
B
A
B
B
A
A
BB
B
B
ú