Data Mining
Association Rules Mining or
Market Basket Analysis
Prithwis Mukerjee, Ph.D.
Prithwis
Mukerjee 2
Let us describe the problem ...
A retailer sells the following items
 And we assume that the shopkeeper keeps track of what
each customer purchases :
 He needs to know which items are generally sold together
Bread Cheese Coffee Juice
Milk Tea BiscuitsSugar Newspaper
Items
10 Bread, Cheese, Newspaper
20 Bread, Cheese, Juice
30 Bread, Milk
40 Cheese, Juice, Milk, Coffee
50 Sugar, Tea, Coffee, Biscuits, Newspaper
60 Sugar, Tea, Coffee, Biscuits, Milk, Juice, Newspaper
70 Bread, Cheese
80 Bread, Cheese, Juice, Coffee
90 Bread, Milk
100 Sugar, Tea, Coffee, Bread, Milk, Juice, Newspaper
Trans ID
Prithwis
Mukerjee 3
Associations
Rules expressing relations between items in a
“Market Basket”
{ Sugar and Tea } => {Biscuits}
 Is it true, that if a customer buys Sugar and Tea, she will
also buy biscuits ?
 If so, then
 These items should be ordered together
 But discounts should not be given on these items at the same
time !
We can make a guess but
 It would be better if we could structure this problem in
terms of mathematics
Prithwis
Mukerjee 4
Basic Concepts
Set of n Items on Sale
 I = { i1
, i2
, i3
, i4
, i5
, i5
, ......, in
}
Transaction
 A subset of I : T ⊆ I
 A set of items purchased in an individual transaction
 With each transaction having m items
 ti
= { i1
, i2
, i3
, i4
, i5
, i5
, ......, im
} with m < n
 If we have N transactions then we have t1
, t2
,t3
,.. tN
as
unique identifier for each transaction
D is our total data about all N transactions
 D = {t1
, t2
,t3
,.. tN
}
Prithwis
Mukerjee 5
An Association Rule
Whenever X appears, Y also appears
 X ⇒ Y
 X ⊆ I, Y ⊆ I, X ∩ Y = ∅
X and Y may be
 Single items or
 Sets of items – in which the same item does not appear
X is referred to as the antecedent
Y is referred to as the consequent
Whether a rule like this exists is the focus of
our analysis
Prithwis
Mukerjee 6
Two key concepts
Support ( or prevalence)
 How often does X and Y appear together in the basket ?
 If this number is very low then it is not worth examining
 Expressed as a fraction of the total number of transactions
 Say 10% or 0.1
Confidence ( or predictability )
 Of all the occurances of X, in what fraction does Y also
appear ?
 Expressed as a fraction of all transactions containing X
 Say 80% or 0.8
We are interested in rules that have a
 Minimum value of support : say 25%
 Minimum value of confidence : say 75%
Prithwis
Mukerjee 7
Mathematically speaking ...
Support (X)
 = (Number of times X appears ) / N
 = P(X)
Support (XY)
 = (Number of times X and Y appears ) / N
 = P(X ∩ Y)
Confidence (X ⇒ Y)
 = Support (XY) / Support(X)
 = Probability (X ∩ Y) / P(X)
 = Conditional Probability P( Y | X)
Lift : an optional term
 Measures the power of association
 P( Y | X) / P(Y)
Prithwis
Mukerjee 8
The task at hand ...
Given a large set of transactions, we seek a
procedure ( or algorithm )
 That will discover all association rules
 That have a minimum support of p%
 And a minimum confidence level of q%
 And to do so in an efficient manner
Algorithms
 The Naive or Brute Force Method
 The Improved Naive algorithm
 The Apriori Algorithm
 Improvements to the Apriori algorithm
 FP ( Frequent Pattern ) Algorithm
Prithwis
Mukerjee 9
Let us try the Naive Algorithm manually !
This is the set of transaction that we have ...
 We want to find Association Rules with
 Minimum 50% support and
 Minimum 75% confidence
Items
100 Bread, Cheese
200 Bread, Cheese, Juice
300 Bread, Milk
400 Cheese, Juice, Milk
Trans ID
Prithwis
Mukerjee 10
Itemsets & Frequencies
Which sets are frequent ?
 Since we are looking for a
support of 50%, we need a
set to appear in 2 out of 4
transactions
 = (# of times X appears ) / N
 = P(X)
 6 sets meet this criteria
Item Sets Frequency
{Bread} 3
{Cheese } 3
{Juice} 2
{Milk} 2
{Bread, Cheese} 2
{Bread, Juice } 1
{Bread, Milk} 1
{Cheese, Juice} 2
{Cheese, Milk} 1
{Juice, Milk} 1
{Bread, Cheese, Juice} 1
{Bread, Cheese, Milk} 0
{Bread, Juice, Milk} 0
{Cheese, Juice, Milk} 1
{Bread, Cheese, Juice, Milk} 0
Prithwis
Mukerjee 11
A closer look at the “Frequent Set”
Look at itemsets with more than 1 item
 {Bread, Cheese}, {Cheese, Juice}
 4 rules are possible
Look for confidence levels
 Confidence (X ⇒ Y)
 = Support (XY) / Support(X)
Item Sets Frequency Rule Confidence
{Bread} 3 Bread => Cheese 2 / 3 67.00%
{Cheese } 3
{Juice} 2 Cheese => Bread 2 / 3 67.00%
{Milk} 2
{Bread, Cheese} 2 Cheese => Juice 2 / 3 67.00%
{Cheese, Juice} 2
Juice => Cheese 2 / 2 100.00%
Prithwis
Mukerjee 12
A closer look at the “Frequent Set”
Look at itemsets with more than 1 item
 {Bread, Cheese}, {Cheese, Juice}
 4 rules are possible
Look for confidence levels
 Confidence (X ⇒ Y)
 = Support (XY) / Support(X)
Item Sets Frequency Rule Confidence
{Bread} 3 Bread => Cheese 2 / 3 67.00%
{Cheese } 3
{Juice} 2 Cheese => Bread 2 / 3 67.00%
{Milk} 2
{Bread, Cheese} 2 Cheese => Juice 2 / 3 67.00%
{Cheese, Juice} 2
Juice => Cheese 2 / 2 100.00%
Prithwis
Mukerjee 13
The Big Picture
List all itemsets
 Find frequency of each
Identify “frequent sets”
 Based on support
Search for Rules within “frequent sets”
 Based on confidence
Prithwis
Mukerjee 14
Looking Beyond the Retail Store
Counter Terrorism
 Track phone calls made
or received from a
particular number every
day
 Is an incoming call from a
particular number
followed by a call to
another number ?
 Are there any sets of
numbers that are always
called together ?
Expand the item sets
to include
 Electronic fund transfers
 Travel between two
locations
 Boarding cards
 Railway reservation
All data is available
in electronic format
Prithwis
Mukerjee 15
Major Problem
Exponential Growth of
number of Itemsets
 4 items : 16 = 24
members
 n items : 2n
members
 As n becomes larger, the
problem cannot be solved
anymore in finite time
All attempts are made to
reduce the number of
Item sets to be processed
“Improved” Naive
algorithm
 Ignore sets with zero
frequency
Item Sets Frequency
{Bread} 3
{Cheese } 3
{Juice} 2
{Milk} 2
{Bread, Cheese} 2
{Bread, Juice } 1
{Bread, Milk} 1
{Cheese, Juice} 2
{Cheese, Milk} 1
{Juice, Milk} 1
{Bread, Cheese, Juice} 1
{Bread, Cheese, Milk} 0
{Bread, Juice, Milk} 0
{Cheese, Juice, Milk} 1
{Bread, Cheese, Juice, Milk} 0
Prithwis
Mukerjee 16
The APriori Algorithm
Consists of two PARTS
 First find the frequent itemsets
 Most of the cleverness happens here
 We will do better than the naive algorithm
 Find the rules
 This is relatively simpler
Prithwis
Mukerjee 17
APriori : Part 1 - Frequent Sets
Step 1
 Scan all transactions and find all frequent items that have
support above p%. This is set L1
Step 2 : Apriori-Gen
 Build potential sets of k items from the Lk-1
by using pairs of
itemsets in Lk-1
that has the first k-2 items common and one
remaining item from each member of the pair.
 This is Candidate set CK
Step 3 : Find Frequent Item Sets again
 Scan all transactions and find frequency of sets in CK
that
are frequent : This gives LK
 If LK
is empty, stop, else go back to step 2
Prithwis
Mukerjee 18
APriori : Part 1 - Frequent Sets
Step 1
 Scan all transactions and find all frequent items that have
support above p% - This is set L1
Prithwis
Mukerjee 19
Example
We have 16 items spread over 25 transactions
Item No Item Name
1 Biscuits
2 Bread
3 Cereal
4 Cheese
5 Chocolate
6 Coffee
7
8 Eggs
9 Juice
10 Milk
11 Newspaper
12 Pastry
13 Rolls
14 Sugar
15 Tea
16
Donuts
Yogurt
TID Items
1
2 Bread, Cereal, Cheese, Coffee
3
4 Bread, Cheese, Coffee, Cereal, Juice
5
6 Milk, Tea
7 Biscuits, Bread, Cheese, Coffee, Milk
8 Eggs, Milk, Tea
9 Bread, Cereal, Cheese, Chocolate, Coffee
10
11 Bread, Cheese, Juice
12
13 Biscuits, Bread, Cereal
14
15 Chocolate, Coffee
16
17
18 Biscuits, Bread, Cheese, Coffee
19
20
21
22 Bread, Cereal, Cheese, Coffee
23
24 Newspaper, Pastry, Rolls
25 Rolls, Sugar, Tea
Biscuits, Bread, Cheese, Coffee, Yogurt
Cheese, Chocolate, Donuts, Juice, Milk
Bread, Cereal, Chocolate, Donuts, Juice
Bread, Cereal, Chocolate, Donuts, Juice
Bread, Cheese, Coffee, Donuts, Juice
Cereal, Cheese, Chocolate, Donuts, Juice
Donuts
Donuts, Eggs, Juice
Bread, Cereal, Chocolate, Donuts, Juice
Cheese, Chocolate, Donuts, Juice
Milk, Tea, Yogurt
Chocolate, Donuts, Juice, Milk, Newspaper
Prithwis
Mukerjee 20
Apriori : Step 1 – Computing L1
Count frequency for each item and exclude
those that are below minimum support
Item No Item Name Frequency
1 Biscuits 4
2 Bread 13
3 Cereal 10
4 Cheese 11
5 Chocolate 9
6 Coffee 9
7 10
8 Eggs 2
9 Juice 11
10 Milk 6
11 Newspaper 2
12 Pastry 1
13 Rolls 2
14 Sugar 1
15 Tea 4
16 2
Donuts
Yogurt
Item No Item Name Frequency
2 Bread 13
3 Cereal 10
4 Cheese 11
5 Chocolate 9
6 Coffee 9
7 10
9 Juice 11
Donuts
25%
support
25%
support
This is set L1
Prithwis
Mukerjee 21
APriori : Part 1 - Frequent Sets
Step 1
 Scan all transactions and find all frequent items that have
support above p%. This is set L1
Step 2 : Apriori-Gen
 Build potential sets of k items from the Lk-1
by using pairs of
itemsets in Lk-1
that has the first k-2 items common and one
remaining item from each member of the pair.
 This is Candidate set CK
Prithwis
Mukerjee 22
Step 2 : Computing C2
 Given L1
, we now form candidate pairs of C2
. The 7 items in
form 21 pairs : d*(d-1)/2 – this is a quadratic function and
not a exponential function.
1 {Bread, Cereal}
2 {Bread, Cheese}
3 {Bread, Chocolate}
4 {Bread, Coffee}
5
6 {Bread,Juice}
7 {Cereal, Cheese}
8 {Cereal, Coffee}
9 {Cereal, Chocolate}
10
11 {Cereal, Juice}
12 {Cheese, Chocolate}
13 {Cheese, Coffee}
14
15 {Cheese, Juice}
16 {Chocolate, Coffee}
17
18 {Chocolate, Juice}
19
20 {Coffee, Juice}
21
{Bread, Donuts}
{Cereal, Donuts}
{Cheese, Donuts}
{Chocolate, Donuts}
{Coffee, Donuts}
{Donuts, Juice}
Item No Item Name Frequency
2 Bread 13
3 Cereal 10
4 Cheese 11
5 Chocolate 9
6 Coffee 9
7 10
9 Juice 11
Donuts
L1
to C2
L1
to C2
Prithwis
Mukerjee 23
APriori : Part 1 - Frequent Sets
Step 1
 Scan all transactions and find all frequent items that have
support above p%. This is set L1
Step 2 : Apriori-Gen
 Build potential sets of k items from the Lk-1
by using pairs of
itemsets in Lk-1
that has the first k-2 items common and one
remaining item from each member of the pair.
 This is Candidate set CK
Step 3 : Find Frequent Item Sets again
 Scan all transactions and find frequency of sets in CK
that
are frequent : This gives LK
 If LK
is empty, stop, else go back to step 2
Prithwis
Mukerjee 24
From C2
to L2
based on minimum support
Candidate 2-Item Set Freq
{Bread, Cereal} 9
{Bread, Cheese} 8
{Bread, Chocolate} 4
{Bread, Coffee} 8
4
{Bread,Juice} 6
{Cereal, Cheese} 5
{Cereal, Coffee} 4
{Cereal, Chocolate} 5
4
{Cereal, Juice} 6
{Cheese, Chocolate} 4
{Cheese, Coffee} 9
3
{Cheese, Juice} 4
{Chocolate, Coffee} 1
7
{Chocolate, Juice} 7
1
{Coffee, Juice} 2
9
{Bread, Donuts}
{Cereal, Donuts}
{Cheese, Donuts}
{Chocolate, Donuts}
{Coffee, Donuts}
{Donuts, Juice}
Frequent 2-Item Set Freq
{Bread, Cereal} 9
{Bread, Cheese} 8
{Bread, Coffee} 8
{Cheese, Coffee} 9
7
{Chocolate, Juice} 7
9
{Chocolate, Donuts}
{Donuts, Juice}
25%
support
25%
support
 This is a computationally
intensive step
 L2
is not empty
This is set L2
Prithwis
Mukerjee 25
APriori : Part 1 - Frequent Sets
Step 1
 Scan all transactions and find all frequent items that have
support above p%. This is set L1
Step 2 : Apriori-Gen
 Build potential sets of k items from the Lk-1
by using pairs of
itemsets in Lk-1
that has the first k-2 items common and one
remaining item from each member of the pair.
 This is Candidate set CK
Step 3 : Find Frequent Item Sets again
 Scan all transactions and find frequency of sets in CK
that
are frequent : This gives LK
 If LK
is empty, stop, else go back to step 2
Prithwis
Mukerjee 26
Step 2 Again : Get C3
 We combine the appropriate frequent 2-item sets from L2
(which must have the same first item) and obtain four such
itemsets each containing three items
Frequent 2-Item Set Freq
{Bread, Cereal} 9
{Bread, Cheese} 8
{Bread, Coffee} 8
{Cheese, Coffee} 9
7
{Chocolate, Juice} 7
9
{Chocolate, Donuts}
{Donuts, Juice}
This is set L2
Candidate 3 item set
{Bread, Cheese, Cereal}
{Bread, Cereal, Coffee}
{Bread, Cheese, Coffee}
{Chocolate, Donut, Juice}
L2
to C3
L2
to C3
Prithwis
Mukerjee 27
Step 3 Again C3
to L3
Again Based on Minimum Support
 Since C4 cannot be formed, L4 cannot be formed so we
stop here
Candidate 3 item set Frequency
{Bread, Cheese, Cereal} 4
{Bread, Cereal, Coffee} 4
{Bread, Cheese, Coffee} 8
7{Chocolate, Donut, Juice}
Frequent 3 item set Frequency
{Bread, Cheese, Coffee} 8
7{Chocolate, Donut, Juice}
25%
support
25%
support
Prithwis
Mukerjee 28
APriori : Part 1 - Frequent Sets
Step 1
 Scan all transactions and find all frequent items that have
support above p%. This is set L1
Step 2 : Apriori-Gen
 Build potential sets of k items from the Lk-1
by using pairs of
itemsets in Lk-1
that has the first k-2 items common and one
remaining item from each member of the pair.
 This is Candidate set CK
Step 3 : Find Frequent Item Sets again
 Scan all transactions and find frequency of sets in CK
that
are frequent : This gives LK
 If LK
is empty, stop, else go back to step 2
Prithwis
Mukerjee 29
The APriori Algorithm
Consists of two PARTS
 First find the frequent itemsets
 Most of the cleverness happens here
 We will do better than the naive algorithm
 Find the rules
 This is relatively simpler
Prithwis
Mukerjee 30
APriori : Part 2 – Find Rules
Rules will be found by looking at
 3-item sets found in L3
 2-item sets in L2 that are not subsets of L3
In each case we
 Calculate confidence (A ⇒ B )
 = P (B | A) = P(A ∩ B ) / P(A)
Some short hand
 {Bread, Cheese, Coffee } is written as { B, C, D}
Prithwis
Mukerjee 31
Rules for Finding Rules !
A 3 item frequent set { BCD} results in 6 rules
 B ⇒ CD, C ⇒ BD, D ⇒ BC
 CD ⇒ B, BD ⇒ C, BC ⇒ D
Also note that
 B ⇒ CD can also be written as
 B ⇒ D, B ⇒ C
We now look at these two 3-item sets and find
their confidence levels
 { Bread, Cheese, Coffee}
 { Chocolate, Donuts, Juice }
 From the L3
set ( the highest L set ) and note that support
for these rules is 8 and 7
Prithwis
Mukerjee 32
Rules from First of 2 Itemsets in L3
One rule drops out because confidence < 70%
 Calculate confidence (X ⇒ Y )
 = P (Y | X) = P(X ∩ Y ) / P(X)
Confidence of association rules from { Bread, Cheese, Coffee }
Rule Confidence
B => CD 8 13 0.615
C => BD 8 11 0.727
D => BC 8 9 0.889
CD => B 8 9 0.889
BD => C 8 8 1.000
BC => D 8 8 1.000
Support
of BCD
Frequency
of LHS
Item No Item Name Frequency
1 Biscuits 4
2 Bread 13
3 Cereal 10
4 Cheese 11
5 Chocolate 9
6 Coffee 9
7 10
8 Eggs 2
9 Juice 11
10 Milk 6
11 Newspaper 2
12 Pastry 1
13 Rolls 2
14 Sugar 1
15 Tea 4
16 2
Donuts
Yogurt
Prithwis
Mukerjee 33
Rules from First of 2 Itemsets in L3
One rule drops out because confidence < 70%
Confidence of association rules from { Bread B, Cheese C, Coffee D }
Rule Confidence
B => CD 8 13 0.615
C => BD 8 11 0.727
D => BC 8 9 0.889
CD => B 8 9 0.889
BD => C 8 8 1.000
BC => D 8 8 1.000
Support
of BCD
Frequency
of LHS
Frequent 2-Item Set Freq
{Bread, Cereal} 9
{Bread, Cheese} 8
{Bread, Coffee} 8
{Cheese, Coffee} 9
7
{Chocolate, Juice} 7
9
{Chocolate, Donuts}
{Donuts, Juice}
Prithwis
Mukerjee 34
Rules from Second of 2 Itemsets in L3
One rule drops out because confidence < 70%
Rule Confidence
N => MP 7 9 0.778
M => NP 7 10 0.700
P => NM 7 11 0.636
MP => N 7 9 0.778
NP => M 7 7 1.000
NM => P 7 7 1.000
Confidence of association rules from { chocolate N, donut M, juice P}
Support
of BCD
Frequency
of LHS
Item No Item Name Frequency
1 Biscuits 4
2 Bread 13
3 Cereal 10
4 Cheese 11
5 Chocolate 9
6 Coffee 9
7 10
8 Eggs 2
9 Juice 11
10 Milk 6
11 Newspaper 2
12 Pastry 1
13 Rolls 2
14 Sugar 1
15 Tea 4
16 2
Donuts
Yogurt
Prithwis
Mukerjee 35
Rules from Second of 2 Itemsets in L3
One rule drops out because confidence < 70%
Rule Confidence
N => MP 7 9 0.778
M => NP 7 10 0.700
P => NM 7 11 0.636
MP => N 7 9 0.778
NP => M 7 7 1.000
NM => P 7 7 1.000
Confidence of association rules from { chocolate N, donut M, juice P}
Support
of BCD
Frequency
of LHS
Frequent 2-Item Set Freq
{Bread, Cereal} 9
{Bread, Cheese} 8
{Bread, Coffee} 8
{Cheese, Coffee} 9
7
{Chocolate, Juice} 7
9
{Chocolate, Donuts}
{Donuts, Juice}
Prithwis
Mukerjee 36
Set of 14 Rules obtained from L3
C => BD
C => B 1 Cheese => Bread
C => D 2 Cheese => Coffee
D => BC
D => B 3 Coffee = > Bread
D => C 4 Coffee => Cheese
CD => B 5 Cheese, Coffee => Bread
BD => C 6 Bread, Coffee => Cheese
BC => D 7 Bread, Cheese => Coffee
N => MP
N => M 8
N => P 9 Chocolate => Juice
M => NP
M => P 10
M => N 11
MP => N 12
NP => M 13
NM => P 14
Chocolate => Donuts
Donuts => Chocolate
Donuts => Juice
Donuts, Juice => Chocolate
Chocolate , Juice => Donuts
Chocolate, Donuts => Juice
Prithwis
Mukerjee 37
What about L2
?
Look for sets in L2
that are not subsets of L3
 { Bread, Cereal} is the only candidate
 Which gives are two more rules
 Bread ⇒ Cereal
 Cereal ⇒ Bread
Frequent 2-Item Set Freq
{Bread, Cereal} 9
{Bread, Cheese} 8
{Bread, Coffee} 8
{Cheese, Coffee} 9
7
{Chocolate, Juice} 7
9
{Chocolate, Donuts}
{Donuts, Juice}
Frequent 3 item set Frequency
{Bread, Cheese, Coffee} 8
7{Chocolate, Donut, Juice}
Prithwis
Mukerjee 38
Which are now added to get 16 rules
C => BD
C => B 1 Cheese => Bread
C => D 2 Cheese => Coffee
D => BC
D => B 3 Coffee = > Bread
D => C 4 Coffee => Cheese
CD => B 5 Cheese, Coffee => Bread
BD => C 6 Bread, Coffee => Cheese
BC => D 7 Bread, Cheese => Coffee
N => MP
N => M 8
N => P 9 Chocolate => Juice
M => NP
M => P 10
M => N 11
MP => N 12
NP => M 13
NM => P 14
15 Bread = > Cereal
16 Cereal => Bread
Chocolate => Donuts
Donuts => Chocolate
Donuts => Juice
Donuts, Juice => Chocolate
Chocolate , Juice => Donuts
Chocolate, Donuts => Juice
Prithwis
Mukerjee 39
So where are we ?
Apriori Algorithm
Consists of two
PARTS
 First find the frequent
itemsets
 Most of the cleverness
happens here
 We will do better than
the naive algorithm
 Find the rules
 This is relatively simpler
We have just
completed the two
PARTS
Overall approach to
ARM is as follows
 List all itemsets
 Find frequency of each
 Identify “frequent sets”
 Based on support
 Search for Rules within
“frequent sets”
 Based on confidence
Naive Algorithm
 Exponential Time
A Priori Algoritm
 Polynomial Time
Prithwis
Mukerjee 40
Observations
Actual values of support and confidence
 25%, 75% are very high values
 In reality one works with far smaller values
“Interestingness” of a rule
 Since X, Y are related events – not independent – hence
P(X ∩ Y) ≠ P(X)P(Y)
 Interestingness ≈ P(X ∩ Y) – P(X)P(Y)
Triviality of rules
 Rules involving very frequent items can be trivial
 You always buy potatoes when you go to the market and
so you can get rules that connect potatoes to many things
Inexplicable rules
 Toothbrush was the most frequent item on Tuesday ??
Prithwis
Mukerjee 41
Better Algorithms
Enhancements to
the Apriori
Algorithm
 AP-TID
 Direct Hashing and
Pruning (DHP)
 Dynamic Itemset
Counting (DIC)
Frequent Pattern (FP)
Tree
 Only frequent items are
needed to find association
rules – so ignore others !
 Move the data of only
frequent items to a more
compact and efficient
structure
 A Tree structure or a directed
graph is used
 Multiple transactions with
same (frequent) items are
stored once with a count
information
Prithwis
Mukerjee 42
Software Support
KDNuggets.com
 Excellent collections of software available
Bart Goethals
 Free software for Apriori, FP-Tree
ARMiner
 GNU Open Source software from UMass/Boston
DMII
 National University of Singapore
DB2 Intelligent Data Miner
 IBM Corporation
 Equivalent software available from other vendors as well

Data mining arm-2009-v0

  • 1.
    Data Mining Association RulesMining or Market Basket Analysis Prithwis Mukerjee, Ph.D.
  • 2.
    Prithwis Mukerjee 2 Let usdescribe the problem ... A retailer sells the following items  And we assume that the shopkeeper keeps track of what each customer purchases :  He needs to know which items are generally sold together Bread Cheese Coffee Juice Milk Tea BiscuitsSugar Newspaper Items 10 Bread, Cheese, Newspaper 20 Bread, Cheese, Juice 30 Bread, Milk 40 Cheese, Juice, Milk, Coffee 50 Sugar, Tea, Coffee, Biscuits, Newspaper 60 Sugar, Tea, Coffee, Biscuits, Milk, Juice, Newspaper 70 Bread, Cheese 80 Bread, Cheese, Juice, Coffee 90 Bread, Milk 100 Sugar, Tea, Coffee, Bread, Milk, Juice, Newspaper Trans ID
  • 3.
    Prithwis Mukerjee 3 Associations Rules expressingrelations between items in a “Market Basket” { Sugar and Tea } => {Biscuits}  Is it true, that if a customer buys Sugar and Tea, she will also buy biscuits ?  If so, then  These items should be ordered together  But discounts should not be given on these items at the same time ! We can make a guess but  It would be better if we could structure this problem in terms of mathematics
  • 4.
    Prithwis Mukerjee 4 Basic Concepts Setof n Items on Sale  I = { i1 , i2 , i3 , i4 , i5 , i5 , ......, in } Transaction  A subset of I : T ⊆ I  A set of items purchased in an individual transaction  With each transaction having m items  ti = { i1 , i2 , i3 , i4 , i5 , i5 , ......, im } with m < n  If we have N transactions then we have t1 , t2 ,t3 ,.. tN as unique identifier for each transaction D is our total data about all N transactions  D = {t1 , t2 ,t3 ,.. tN }
  • 5.
    Prithwis Mukerjee 5 An AssociationRule Whenever X appears, Y also appears  X ⇒ Y  X ⊆ I, Y ⊆ I, X ∩ Y = ∅ X and Y may be  Single items or  Sets of items – in which the same item does not appear X is referred to as the antecedent Y is referred to as the consequent Whether a rule like this exists is the focus of our analysis
  • 6.
    Prithwis Mukerjee 6 Two keyconcepts Support ( or prevalence)  How often does X and Y appear together in the basket ?  If this number is very low then it is not worth examining  Expressed as a fraction of the total number of transactions  Say 10% or 0.1 Confidence ( or predictability )  Of all the occurances of X, in what fraction does Y also appear ?  Expressed as a fraction of all transactions containing X  Say 80% or 0.8 We are interested in rules that have a  Minimum value of support : say 25%  Minimum value of confidence : say 75%
  • 7.
    Prithwis Mukerjee 7 Mathematically speaking... Support (X)  = (Number of times X appears ) / N  = P(X) Support (XY)  = (Number of times X and Y appears ) / N  = P(X ∩ Y) Confidence (X ⇒ Y)  = Support (XY) / Support(X)  = Probability (X ∩ Y) / P(X)  = Conditional Probability P( Y | X) Lift : an optional term  Measures the power of association  P( Y | X) / P(Y)
  • 8.
    Prithwis Mukerjee 8 The taskat hand ... Given a large set of transactions, we seek a procedure ( or algorithm )  That will discover all association rules  That have a minimum support of p%  And a minimum confidence level of q%  And to do so in an efficient manner Algorithms  The Naive or Brute Force Method  The Improved Naive algorithm  The Apriori Algorithm  Improvements to the Apriori algorithm  FP ( Frequent Pattern ) Algorithm
  • 9.
    Prithwis Mukerjee 9 Let ustry the Naive Algorithm manually ! This is the set of transaction that we have ...  We want to find Association Rules with  Minimum 50% support and  Minimum 75% confidence Items 100 Bread, Cheese 200 Bread, Cheese, Juice 300 Bread, Milk 400 Cheese, Juice, Milk Trans ID
  • 10.
    Prithwis Mukerjee 10 Itemsets &Frequencies Which sets are frequent ?  Since we are looking for a support of 50%, we need a set to appear in 2 out of 4 transactions  = (# of times X appears ) / N  = P(X)  6 sets meet this criteria Item Sets Frequency {Bread} 3 {Cheese } 3 {Juice} 2 {Milk} 2 {Bread, Cheese} 2 {Bread, Juice } 1 {Bread, Milk} 1 {Cheese, Juice} 2 {Cheese, Milk} 1 {Juice, Milk} 1 {Bread, Cheese, Juice} 1 {Bread, Cheese, Milk} 0 {Bread, Juice, Milk} 0 {Cheese, Juice, Milk} 1 {Bread, Cheese, Juice, Milk} 0
  • 11.
    Prithwis Mukerjee 11 A closerlook at the “Frequent Set” Look at itemsets with more than 1 item  {Bread, Cheese}, {Cheese, Juice}  4 rules are possible Look for confidence levels  Confidence (X ⇒ Y)  = Support (XY) / Support(X) Item Sets Frequency Rule Confidence {Bread} 3 Bread => Cheese 2 / 3 67.00% {Cheese } 3 {Juice} 2 Cheese => Bread 2 / 3 67.00% {Milk} 2 {Bread, Cheese} 2 Cheese => Juice 2 / 3 67.00% {Cheese, Juice} 2 Juice => Cheese 2 / 2 100.00%
  • 12.
    Prithwis Mukerjee 12 A closerlook at the “Frequent Set” Look at itemsets with more than 1 item  {Bread, Cheese}, {Cheese, Juice}  4 rules are possible Look for confidence levels  Confidence (X ⇒ Y)  = Support (XY) / Support(X) Item Sets Frequency Rule Confidence {Bread} 3 Bread => Cheese 2 / 3 67.00% {Cheese } 3 {Juice} 2 Cheese => Bread 2 / 3 67.00% {Milk} 2 {Bread, Cheese} 2 Cheese => Juice 2 / 3 67.00% {Cheese, Juice} 2 Juice => Cheese 2 / 2 100.00%
  • 13.
    Prithwis Mukerjee 13 The BigPicture List all itemsets  Find frequency of each Identify “frequent sets”  Based on support Search for Rules within “frequent sets”  Based on confidence
  • 14.
    Prithwis Mukerjee 14 Looking Beyondthe Retail Store Counter Terrorism  Track phone calls made or received from a particular number every day  Is an incoming call from a particular number followed by a call to another number ?  Are there any sets of numbers that are always called together ? Expand the item sets to include  Electronic fund transfers  Travel between two locations  Boarding cards  Railway reservation All data is available in electronic format
  • 15.
    Prithwis Mukerjee 15 Major Problem ExponentialGrowth of number of Itemsets  4 items : 16 = 24 members  n items : 2n members  As n becomes larger, the problem cannot be solved anymore in finite time All attempts are made to reduce the number of Item sets to be processed “Improved” Naive algorithm  Ignore sets with zero frequency Item Sets Frequency {Bread} 3 {Cheese } 3 {Juice} 2 {Milk} 2 {Bread, Cheese} 2 {Bread, Juice } 1 {Bread, Milk} 1 {Cheese, Juice} 2 {Cheese, Milk} 1 {Juice, Milk} 1 {Bread, Cheese, Juice} 1 {Bread, Cheese, Milk} 0 {Bread, Juice, Milk} 0 {Cheese, Juice, Milk} 1 {Bread, Cheese, Juice, Milk} 0
  • 16.
    Prithwis Mukerjee 16 The APrioriAlgorithm Consists of two PARTS  First find the frequent itemsets  Most of the cleverness happens here  We will do better than the naive algorithm  Find the rules  This is relatively simpler
  • 17.
    Prithwis Mukerjee 17 APriori :Part 1 - Frequent Sets Step 1  Scan all transactions and find all frequent items that have support above p%. This is set L1 Step 2 : Apriori-Gen  Build potential sets of k items from the Lk-1 by using pairs of itemsets in Lk-1 that has the first k-2 items common and one remaining item from each member of the pair.  This is Candidate set CK Step 3 : Find Frequent Item Sets again  Scan all transactions and find frequency of sets in CK that are frequent : This gives LK  If LK is empty, stop, else go back to step 2
  • 18.
    Prithwis Mukerjee 18 APriori :Part 1 - Frequent Sets Step 1  Scan all transactions and find all frequent items that have support above p% - This is set L1
  • 19.
    Prithwis Mukerjee 19 Example We have16 items spread over 25 transactions Item No Item Name 1 Biscuits 2 Bread 3 Cereal 4 Cheese 5 Chocolate 6 Coffee 7 8 Eggs 9 Juice 10 Milk 11 Newspaper 12 Pastry 13 Rolls 14 Sugar 15 Tea 16 Donuts Yogurt TID Items 1 2 Bread, Cereal, Cheese, Coffee 3 4 Bread, Cheese, Coffee, Cereal, Juice 5 6 Milk, Tea 7 Biscuits, Bread, Cheese, Coffee, Milk 8 Eggs, Milk, Tea 9 Bread, Cereal, Cheese, Chocolate, Coffee 10 11 Bread, Cheese, Juice 12 13 Biscuits, Bread, Cereal 14 15 Chocolate, Coffee 16 17 18 Biscuits, Bread, Cheese, Coffee 19 20 21 22 Bread, Cereal, Cheese, Coffee 23 24 Newspaper, Pastry, Rolls 25 Rolls, Sugar, Tea Biscuits, Bread, Cheese, Coffee, Yogurt Cheese, Chocolate, Donuts, Juice, Milk Bread, Cereal, Chocolate, Donuts, Juice Bread, Cereal, Chocolate, Donuts, Juice Bread, Cheese, Coffee, Donuts, Juice Cereal, Cheese, Chocolate, Donuts, Juice Donuts Donuts, Eggs, Juice Bread, Cereal, Chocolate, Donuts, Juice Cheese, Chocolate, Donuts, Juice Milk, Tea, Yogurt Chocolate, Donuts, Juice, Milk, Newspaper
  • 20.
    Prithwis Mukerjee 20 Apriori :Step 1 – Computing L1 Count frequency for each item and exclude those that are below minimum support Item No Item Name Frequency 1 Biscuits 4 2 Bread 13 3 Cereal 10 4 Cheese 11 5 Chocolate 9 6 Coffee 9 7 10 8 Eggs 2 9 Juice 11 10 Milk 6 11 Newspaper 2 12 Pastry 1 13 Rolls 2 14 Sugar 1 15 Tea 4 16 2 Donuts Yogurt Item No Item Name Frequency 2 Bread 13 3 Cereal 10 4 Cheese 11 5 Chocolate 9 6 Coffee 9 7 10 9 Juice 11 Donuts 25% support 25% support This is set L1
  • 21.
    Prithwis Mukerjee 21 APriori :Part 1 - Frequent Sets Step 1  Scan all transactions and find all frequent items that have support above p%. This is set L1 Step 2 : Apriori-Gen  Build potential sets of k items from the Lk-1 by using pairs of itemsets in Lk-1 that has the first k-2 items common and one remaining item from each member of the pair.  This is Candidate set CK
  • 22.
    Prithwis Mukerjee 22 Step 2: Computing C2  Given L1 , we now form candidate pairs of C2 . The 7 items in form 21 pairs : d*(d-1)/2 – this is a quadratic function and not a exponential function. 1 {Bread, Cereal} 2 {Bread, Cheese} 3 {Bread, Chocolate} 4 {Bread, Coffee} 5 6 {Bread,Juice} 7 {Cereal, Cheese} 8 {Cereal, Coffee} 9 {Cereal, Chocolate} 10 11 {Cereal, Juice} 12 {Cheese, Chocolate} 13 {Cheese, Coffee} 14 15 {Cheese, Juice} 16 {Chocolate, Coffee} 17 18 {Chocolate, Juice} 19 20 {Coffee, Juice} 21 {Bread, Donuts} {Cereal, Donuts} {Cheese, Donuts} {Chocolate, Donuts} {Coffee, Donuts} {Donuts, Juice} Item No Item Name Frequency 2 Bread 13 3 Cereal 10 4 Cheese 11 5 Chocolate 9 6 Coffee 9 7 10 9 Juice 11 Donuts L1 to C2 L1 to C2
  • 23.
    Prithwis Mukerjee 23 APriori :Part 1 - Frequent Sets Step 1  Scan all transactions and find all frequent items that have support above p%. This is set L1 Step 2 : Apriori-Gen  Build potential sets of k items from the Lk-1 by using pairs of itemsets in Lk-1 that has the first k-2 items common and one remaining item from each member of the pair.  This is Candidate set CK Step 3 : Find Frequent Item Sets again  Scan all transactions and find frequency of sets in CK that are frequent : This gives LK  If LK is empty, stop, else go back to step 2
  • 24.
    Prithwis Mukerjee 24 From C2 toL2 based on minimum support Candidate 2-Item Set Freq {Bread, Cereal} 9 {Bread, Cheese} 8 {Bread, Chocolate} 4 {Bread, Coffee} 8 4 {Bread,Juice} 6 {Cereal, Cheese} 5 {Cereal, Coffee} 4 {Cereal, Chocolate} 5 4 {Cereal, Juice} 6 {Cheese, Chocolate} 4 {Cheese, Coffee} 9 3 {Cheese, Juice} 4 {Chocolate, Coffee} 1 7 {Chocolate, Juice} 7 1 {Coffee, Juice} 2 9 {Bread, Donuts} {Cereal, Donuts} {Cheese, Donuts} {Chocolate, Donuts} {Coffee, Donuts} {Donuts, Juice} Frequent 2-Item Set Freq {Bread, Cereal} 9 {Bread, Cheese} 8 {Bread, Coffee} 8 {Cheese, Coffee} 9 7 {Chocolate, Juice} 7 9 {Chocolate, Donuts} {Donuts, Juice} 25% support 25% support  This is a computationally intensive step  L2 is not empty This is set L2
  • 25.
    Prithwis Mukerjee 25 APriori :Part 1 - Frequent Sets Step 1  Scan all transactions and find all frequent items that have support above p%. This is set L1 Step 2 : Apriori-Gen  Build potential sets of k items from the Lk-1 by using pairs of itemsets in Lk-1 that has the first k-2 items common and one remaining item from each member of the pair.  This is Candidate set CK Step 3 : Find Frequent Item Sets again  Scan all transactions and find frequency of sets in CK that are frequent : This gives LK  If LK is empty, stop, else go back to step 2
  • 26.
    Prithwis Mukerjee 26 Step 2Again : Get C3  We combine the appropriate frequent 2-item sets from L2 (which must have the same first item) and obtain four such itemsets each containing three items Frequent 2-Item Set Freq {Bread, Cereal} 9 {Bread, Cheese} 8 {Bread, Coffee} 8 {Cheese, Coffee} 9 7 {Chocolate, Juice} 7 9 {Chocolate, Donuts} {Donuts, Juice} This is set L2 Candidate 3 item set {Bread, Cheese, Cereal} {Bread, Cereal, Coffee} {Bread, Cheese, Coffee} {Chocolate, Donut, Juice} L2 to C3 L2 to C3
  • 27.
    Prithwis Mukerjee 27 Step 3Again C3 to L3 Again Based on Minimum Support  Since C4 cannot be formed, L4 cannot be formed so we stop here Candidate 3 item set Frequency {Bread, Cheese, Cereal} 4 {Bread, Cereal, Coffee} 4 {Bread, Cheese, Coffee} 8 7{Chocolate, Donut, Juice} Frequent 3 item set Frequency {Bread, Cheese, Coffee} 8 7{Chocolate, Donut, Juice} 25% support 25% support
  • 28.
    Prithwis Mukerjee 28 APriori :Part 1 - Frequent Sets Step 1  Scan all transactions and find all frequent items that have support above p%. This is set L1 Step 2 : Apriori-Gen  Build potential sets of k items from the Lk-1 by using pairs of itemsets in Lk-1 that has the first k-2 items common and one remaining item from each member of the pair.  This is Candidate set CK Step 3 : Find Frequent Item Sets again  Scan all transactions and find frequency of sets in CK that are frequent : This gives LK  If LK is empty, stop, else go back to step 2
  • 29.
    Prithwis Mukerjee 29 The APrioriAlgorithm Consists of two PARTS  First find the frequent itemsets  Most of the cleverness happens here  We will do better than the naive algorithm  Find the rules  This is relatively simpler
  • 30.
    Prithwis Mukerjee 30 APriori :Part 2 – Find Rules Rules will be found by looking at  3-item sets found in L3  2-item sets in L2 that are not subsets of L3 In each case we  Calculate confidence (A ⇒ B )  = P (B | A) = P(A ∩ B ) / P(A) Some short hand  {Bread, Cheese, Coffee } is written as { B, C, D}
  • 31.
    Prithwis Mukerjee 31 Rules forFinding Rules ! A 3 item frequent set { BCD} results in 6 rules  B ⇒ CD, C ⇒ BD, D ⇒ BC  CD ⇒ B, BD ⇒ C, BC ⇒ D Also note that  B ⇒ CD can also be written as  B ⇒ D, B ⇒ C We now look at these two 3-item sets and find their confidence levels  { Bread, Cheese, Coffee}  { Chocolate, Donuts, Juice }  From the L3 set ( the highest L set ) and note that support for these rules is 8 and 7
  • 32.
    Prithwis Mukerjee 32 Rules fromFirst of 2 Itemsets in L3 One rule drops out because confidence < 70%  Calculate confidence (X ⇒ Y )  = P (Y | X) = P(X ∩ Y ) / P(X) Confidence of association rules from { Bread, Cheese, Coffee } Rule Confidence B => CD 8 13 0.615 C => BD 8 11 0.727 D => BC 8 9 0.889 CD => B 8 9 0.889 BD => C 8 8 1.000 BC => D 8 8 1.000 Support of BCD Frequency of LHS Item No Item Name Frequency 1 Biscuits 4 2 Bread 13 3 Cereal 10 4 Cheese 11 5 Chocolate 9 6 Coffee 9 7 10 8 Eggs 2 9 Juice 11 10 Milk 6 11 Newspaper 2 12 Pastry 1 13 Rolls 2 14 Sugar 1 15 Tea 4 16 2 Donuts Yogurt
  • 33.
    Prithwis Mukerjee 33 Rules fromFirst of 2 Itemsets in L3 One rule drops out because confidence < 70% Confidence of association rules from { Bread B, Cheese C, Coffee D } Rule Confidence B => CD 8 13 0.615 C => BD 8 11 0.727 D => BC 8 9 0.889 CD => B 8 9 0.889 BD => C 8 8 1.000 BC => D 8 8 1.000 Support of BCD Frequency of LHS Frequent 2-Item Set Freq {Bread, Cereal} 9 {Bread, Cheese} 8 {Bread, Coffee} 8 {Cheese, Coffee} 9 7 {Chocolate, Juice} 7 9 {Chocolate, Donuts} {Donuts, Juice}
  • 34.
    Prithwis Mukerjee 34 Rules fromSecond of 2 Itemsets in L3 One rule drops out because confidence < 70% Rule Confidence N => MP 7 9 0.778 M => NP 7 10 0.700 P => NM 7 11 0.636 MP => N 7 9 0.778 NP => M 7 7 1.000 NM => P 7 7 1.000 Confidence of association rules from { chocolate N, donut M, juice P} Support of BCD Frequency of LHS Item No Item Name Frequency 1 Biscuits 4 2 Bread 13 3 Cereal 10 4 Cheese 11 5 Chocolate 9 6 Coffee 9 7 10 8 Eggs 2 9 Juice 11 10 Milk 6 11 Newspaper 2 12 Pastry 1 13 Rolls 2 14 Sugar 1 15 Tea 4 16 2 Donuts Yogurt
  • 35.
    Prithwis Mukerjee 35 Rules fromSecond of 2 Itemsets in L3 One rule drops out because confidence < 70% Rule Confidence N => MP 7 9 0.778 M => NP 7 10 0.700 P => NM 7 11 0.636 MP => N 7 9 0.778 NP => M 7 7 1.000 NM => P 7 7 1.000 Confidence of association rules from { chocolate N, donut M, juice P} Support of BCD Frequency of LHS Frequent 2-Item Set Freq {Bread, Cereal} 9 {Bread, Cheese} 8 {Bread, Coffee} 8 {Cheese, Coffee} 9 7 {Chocolate, Juice} 7 9 {Chocolate, Donuts} {Donuts, Juice}
  • 36.
    Prithwis Mukerjee 36 Set of14 Rules obtained from L3 C => BD C => B 1 Cheese => Bread C => D 2 Cheese => Coffee D => BC D => B 3 Coffee = > Bread D => C 4 Coffee => Cheese CD => B 5 Cheese, Coffee => Bread BD => C 6 Bread, Coffee => Cheese BC => D 7 Bread, Cheese => Coffee N => MP N => M 8 N => P 9 Chocolate => Juice M => NP M => P 10 M => N 11 MP => N 12 NP => M 13 NM => P 14 Chocolate => Donuts Donuts => Chocolate Donuts => Juice Donuts, Juice => Chocolate Chocolate , Juice => Donuts Chocolate, Donuts => Juice
  • 37.
    Prithwis Mukerjee 37 What aboutL2 ? Look for sets in L2 that are not subsets of L3  { Bread, Cereal} is the only candidate  Which gives are two more rules  Bread ⇒ Cereal  Cereal ⇒ Bread Frequent 2-Item Set Freq {Bread, Cereal} 9 {Bread, Cheese} 8 {Bread, Coffee} 8 {Cheese, Coffee} 9 7 {Chocolate, Juice} 7 9 {Chocolate, Donuts} {Donuts, Juice} Frequent 3 item set Frequency {Bread, Cheese, Coffee} 8 7{Chocolate, Donut, Juice}
  • 38.
    Prithwis Mukerjee 38 Which arenow added to get 16 rules C => BD C => B 1 Cheese => Bread C => D 2 Cheese => Coffee D => BC D => B 3 Coffee = > Bread D => C 4 Coffee => Cheese CD => B 5 Cheese, Coffee => Bread BD => C 6 Bread, Coffee => Cheese BC => D 7 Bread, Cheese => Coffee N => MP N => M 8 N => P 9 Chocolate => Juice M => NP M => P 10 M => N 11 MP => N 12 NP => M 13 NM => P 14 15 Bread = > Cereal 16 Cereal => Bread Chocolate => Donuts Donuts => Chocolate Donuts => Juice Donuts, Juice => Chocolate Chocolate , Juice => Donuts Chocolate, Donuts => Juice
  • 39.
    Prithwis Mukerjee 39 So whereare we ? Apriori Algorithm Consists of two PARTS  First find the frequent itemsets  Most of the cleverness happens here  We will do better than the naive algorithm  Find the rules  This is relatively simpler We have just completed the two PARTS Overall approach to ARM is as follows  List all itemsets  Find frequency of each  Identify “frequent sets”  Based on support  Search for Rules within “frequent sets”  Based on confidence Naive Algorithm  Exponential Time A Priori Algoritm  Polynomial Time
  • 40.
    Prithwis Mukerjee 40 Observations Actual valuesof support and confidence  25%, 75% are very high values  In reality one works with far smaller values “Interestingness” of a rule  Since X, Y are related events – not independent – hence P(X ∩ Y) ≠ P(X)P(Y)  Interestingness ≈ P(X ∩ Y) – P(X)P(Y) Triviality of rules  Rules involving very frequent items can be trivial  You always buy potatoes when you go to the market and so you can get rules that connect potatoes to many things Inexplicable rules  Toothbrush was the most frequent item on Tuesday ??
  • 41.
    Prithwis Mukerjee 41 Better Algorithms Enhancementsto the Apriori Algorithm  AP-TID  Direct Hashing and Pruning (DHP)  Dynamic Itemset Counting (DIC) Frequent Pattern (FP) Tree  Only frequent items are needed to find association rules – so ignore others !  Move the data of only frequent items to a more compact and efficient structure  A Tree structure or a directed graph is used  Multiple transactions with same (frequent) items are stored once with a count information
  • 42.
    Prithwis Mukerjee 42 Software Support KDNuggets.com Excellent collections of software available Bart Goethals  Free software for Apriori, FP-Tree ARMiner  GNU Open Source software from UMass/Boston DMII  National University of Singapore DB2 Intelligent Data Miner  IBM Corporation  Equivalent software available from other vendors as well