Data mining arm-2009-v0

Data Mining
Association Rules Mining or
Market Basket Analysis
Prithwis Mukerjee, Ph.D.

Prithwis
Mukerjee 2
Let us describe the problem ...
A retailer sells the following items
 And we assume that the shopkeeper keeps track of what
each customer purchases :
 He needs to know which items are generally sold together
Bread Cheese Coffee Juice
Milk Tea BiscuitsSugar Newspaper
Items
10 Bread, Cheese, Newspaper
20 Bread, Cheese, Juice
30 Bread, Milk
40 Cheese, Juice, Milk, Coffee
50 Sugar, Tea, Coffee, Biscuits, Newspaper
60 Sugar, Tea, Coffee, Biscuits, Milk, Juice, Newspaper
70 Bread, Cheese
80 Bread, Cheese, Juice, Coffee
90 Bread, Milk
100 Sugar, Tea, Coffee, Bread, Milk, Juice, Newspaper
Trans ID

Prithwis
Mukerjee 3
Associations
Rules expressing relations between items in a
“Market Basket”
{ Sugar and Tea } => {Biscuits}
 Is it true, that if a customer buys Sugar and Tea, she will
also buy biscuits ?
 If so, then
 These items should be ordered together
 But discounts should not be given on these items at the same
time !
We can make a guess but
 It would be better if we could structure this problem in
terms of mathematics

Prithwis
Mukerjee 4
Basic Concepts
Set of n Items on Sale
 I = { i1
, i2
, i3
, i4
, i5
, i5
, ......, in
}
Transaction
 A subset of I : T ⊆ I
 A set of items purchased in an individual transaction
 With each transaction having m items
 ti
= { i1
, i2
, i3
, i4
, i5
, i5
, ......, im
} with m < n
 If we have N transactions then we have t1
, t2
,t3
,.. tN
as
unique identifier for each transaction
D is our total data about all N transactions
 D = {t1
, t2
,t3
,.. tN
}

Prithwis
Mukerjee 5
An Association Rule
Whenever X appears, Y also appears
 X ⇒ Y
 X ⊆ I, Y ⊆ I, X ∩ Y = ∅
X and Y may be
 Single items or
 Sets of items – in which the same item does not appear
X is referred to as the antecedent
Y is referred to as the consequent
Whether a rule like this exists is the focus of
our analysis

Prithwis
Mukerjee 6
Two key concepts
Support ( or prevalence)
 How often does X and Y appear together in the basket ?
 If this number is very low then it is not worth examining
 Expressed as a fraction of the total number of transactions
 Say 10% or 0.1
Confidence ( or predictability )
 Of all the occurances of X, in what fraction does Y also
appear ?
 Expressed as a fraction of all transactions containing X
 Say 80% or 0.8
We are interested in rules that have a
 Minimum value of support : say 25%
 Minimum value of confidence : say 75%

Prithwis
Mukerjee 7
Mathematically speaking ...
Support (X)
 = (Number of times X appears ) / N
 = P(X)
Support (XY)
 = (Number of times X and Y appears ) / N
 = P(X ∩ Y)
Confidence (X ⇒ Y)
 = Support (XY) / Support(X)
 = Probability (X ∩ Y) / P(X)
 = Conditional Probability P( Y | X)
Lift : an optional term
 Measures the power of association
 P( Y | X) / P(Y)

Prithwis
Mukerjee 8
The task at hand ...
Given a large set of transactions, we seek a
procedure ( or algorithm )
 That will discover all association rules
 That have a minimum support of p%
 And a minimum confidence level of q%
 And to do so in an efficient manner
Algorithms
 The Naive or Brute Force Method
 The Improved Naive algorithm
 The Apriori Algorithm
 Improvements to the Apriori algorithm
 FP ( Frequent Pattern ) Algorithm

Prithwis
Mukerjee 9
Let us try the Naive Algorithm manually !
This is the set of transaction that we have ...
 We want to find Association Rules with
 Minimum 50% support and
 Minimum 75% confidence
Items
100 Bread, Cheese
300 Bread, Milk
400 Cheese, Juice, Milk
Trans ID

Prithwis
Mukerjee 10
Itemsets & Frequencies
Which sets are frequent ?
 Since we are looking for a
support of 50%, we need a
set to appear in 2 out of 4
transactions
 = (# of times X appears ) / N
 = P(X)
 6 sets meet this criteria
Item Sets Frequency
{Bread} 3
{Cheese } 3
{Juice} 2
{Milk} 2
{Bread, Cheese} 2
{Bread, Juice } 1
{Bread, Milk} 1
{Cheese, Juice} 2
{Cheese, Milk} 1
{Juice, Milk} 1
{Bread, Cheese, Juice} 1
{Bread, Cheese, Milk} 0
{Bread, Juice, Milk} 0
{Cheese, Juice, Milk} 1
{Bread, Cheese, Juice, Milk} 0

Prithwis
Mukerjee 11
A closer look at the “Frequent Set”
Look at itemsets with more than 1 item
 {Bread, Cheese}, {Cheese, Juice}
 4 rules are possible
Look for confidence levels
 Confidence (X ⇒ Y)
Item Sets Frequency Rule Confidence
{Bread} 3 Bread => Cheese 2 / 3 67.00%
{Cheese } 3
{Juice} 2 Cheese => Bread 2 / 3 67.00%
{Milk} 2
{Bread, Cheese} 2 Cheese => Juice 2 / 3 67.00%
{Cheese, Juice} 2
Juice => Cheese 2 / 2 100.00%

Prithwis
Mukerjee 12
A closer look at the “Frequent Set”
Look at itemsets with more than 1 item
 {Bread, Cheese}, {Cheese, Juice}
 4 rules are possible
Look for confidence levels
 Confidence (X ⇒ Y)
Item Sets Frequency Rule Confidence
{Bread} 3 Bread => Cheese 2 / 3 67.00%
{Cheese } 3
{Juice} 2 Cheese => Bread 2 / 3 67.00%
{Milk} 2
{Bread, Cheese} 2 Cheese => Juice 2 / 3 67.00%
{Cheese, Juice} 2
Juice => Cheese 2 / 2 100.00%

Prithwis
Mukerjee 13
The Big Picture
List all itemsets
 Find frequency of each
Identify “frequent sets”
 Based on support
Search for Rules within “frequent sets”
 Based on confidence

Prithwis
Mukerjee 14
Looking Beyond the Retail Store
Counter Terrorism
 Track phone calls made
or received from a
particular number every
day
 Is an incoming call from a
particular number
followed by a call to
another number ?
 Are there any sets of
numbers that are always
called together ?
Expand the item sets
to include
 Electronic fund transfers
 Travel between two
locations
 Boarding cards
 Railway reservation
All data is available
in electronic format

Prithwis
Mukerjee 15
Major Problem
Exponential Growth of
number of Itemsets
 4 items : 16 = 24
members
 n items : 2n
members
 As n becomes larger, the
problem cannot be solved
anymore in finite time
All attempts are made to
reduce the number of
Item sets to be processed
“Improved” Naive
algorithm
 Ignore sets with zero
frequency
Item Sets Frequency
{Bread} 3
{Cheese } 3
{Juice} 2
{Milk} 2
{Bread, Cheese} 2
{Bread, Juice } 1
{Bread, Milk} 1
{Cheese, Juice} 2
{Cheese, Milk} 1
{Juice, Milk} 1
{Bread, Cheese, Juice} 1
{Bread, Cheese, Milk} 0
{Bread, Juice, Milk} 0
{Cheese, Juice, Milk} 1
{Bread, Cheese, Juice, Milk} 0

Prithwis
Mukerjee 16
The APriori Algorithm
Consists of two PARTS
 First find the frequent itemsets
 Most of the cleverness happens here
 We will do better than the naive algorithm
 Find the rules
 This is relatively simpler

Prithwis
Mukerjee 17
APriori : Part 1 - Frequent Sets
Step 1
 Scan all transactions and find all frequent items that have
support above p%. This is set L1
Step 2 : Apriori-Gen
 Build potential sets of k items from the Lk-1
by using pairs of
itemsets in Lk-1
that has the first k-2 items common and one
remaining item from each member of the pair.
 This is Candidate set CK
Step 3 : Find Frequent Item Sets again
 Scan all transactions and find frequency of sets in CK
that
are frequent : This gives LK
 If LK
is empty, stop, else go back to step 2

Prithwis
Mukerjee 18
Step 1
support above p% - This is set L1

Prithwis
Mukerjee 19
Example
We have 16 items spread over 25 transactions
Item No Item Name
1 Biscuits
2 Bread
3 Cereal
4 Cheese
5 Chocolate
6 Coffee
7
8 Eggs
9 Juice
10 Milk
11 Newspaper
12 Pastry
13 Rolls
14 Sugar
15 Tea
16
Donuts
Yogurt
TID Items
1
2 Bread, Cereal, Cheese, Coffee
3
4 Bread, Cheese, Coffee, Cereal, Juice
5
6 Milk, Tea
7 Biscuits, Bread, Cheese, Coffee, Milk
8 Eggs, Milk, Tea
9 Bread, Cereal, Cheese, Chocolate, Coffee
10
12
13 Biscuits, Bread, Cereal
14
15 Chocolate, Coffee
16
17
18 Biscuits, Bread, Cheese, Coffee
19
20
21
22 Bread, Cereal, Cheese, Coffee
23
24 Newspaper, Pastry, Rolls
25 Rolls, Sugar, Tea
Biscuits, Bread, Cheese, Coffee, Yogurt
Cheese, Chocolate, Donuts, Juice, Milk
Bread, Cereal, Chocolate, Donuts, Juice
Bread, Cheese, Coffee, Donuts, Juice
Cereal, Cheese, Chocolate, Donuts, Juice
Donuts
Donuts, Eggs, Juice
Cheese, Chocolate, Donuts, Juice
Milk, Tea, Yogurt
Chocolate, Donuts, Juice, Milk, Newspaper

Prithwis
Mukerjee 20
Apriori : Step 1 – Computing L1
Count frequency for each item and exclude
those that are below minimum support
Item No Item Name Frequency
1 Biscuits 4
2 Bread 13
3 Cereal 10
4 Cheese 11
5 Chocolate 9
6 Coffee 9
7 10
8 Eggs 2
9 Juice 11
10 Milk 6
11 Newspaper 2
12 Pastry 1
13 Rolls 2
14 Sugar 1
15 Tea 4
16 2
Donuts
Yogurt
2 Bread 13
3 Cereal 10
4 Cheese 11
5 Chocolate 9
6 Coffee 9
7 10
9 Juice 11
Donuts
25%
support
25%
support
This is set L1

Prithwis
Mukerjee 21
Step 1
by using pairs of
itemsets in Lk-1

Prithwis
Mukerjee 22
Step 2 : Computing C2
 Given L1
, we now form candidate pairs of C2
. The 7 items in
form 21 pairs : d*(d-1)/2 – this is a quadratic function and
not a exponential function.
1 {Bread, Cereal}
2 {Bread, Cheese}
3 {Bread, Chocolate}
4 {Bread, Coffee}
5
6 {Bread,Juice}
7 {Cereal, Cheese}
8 {Cereal, Coffee}
9 {Cereal, Chocolate}
10
11 {Cereal, Juice}
12 {Cheese, Chocolate}
13 {Cheese, Coffee}
14
15 {Cheese, Juice}
16 {Chocolate, Coffee}
17
18 {Chocolate, Juice}
19
20 {Coffee, Juice}
21
{Bread, Donuts}
{Cereal, Donuts}
{Cheese, Donuts}
{Chocolate, Donuts}
{Coffee, Donuts}
{Donuts, Juice}
2 Bread 13
3 Cereal 10
4 Cheese 11
5 Chocolate 9
6 Coffee 9
7 10
9 Juice 11
Donuts
L1
to C2
L1
to C2

Prithwis
Mukerjee 23
Step 1
by using pairs of
itemsets in Lk-1
that
 If LK

Prithwis
Mukerjee 24
From C2
to L2
based on minimum support
Candidate 2-Item Set Freq
{Bread, Cereal} 9
{Bread, Cheese} 8
{Bread, Chocolate} 4
{Bread, Coffee} 8
4
{Bread,Juice} 6
{Cereal, Cheese} 5
{Cereal, Coffee} 4
{Cereal, Chocolate} 5
4
{Cereal, Juice} 6
{Cheese, Chocolate} 4
{Cheese, Coffee} 9
3
{Cheese, Juice} 4
{Chocolate, Coffee} 1
7
{Chocolate, Juice} 7
1
{Coffee, Juice} 2
9
{Bread, Donuts}
{Cereal, Donuts}
{Cheese, Donuts}
{Chocolate, Donuts}
{Coffee, Donuts}
{Donuts, Juice}
Frequent 2-Item Set Freq
{Bread, Cereal} 9
{Bread, Cheese} 8
{Bread, Coffee} 8
{Cheese, Coffee} 9
7
9
{Chocolate, Donuts}
{Donuts, Juice}
25%
support
25%
support
 This is a computationally
intensive step
 L2
is not empty
This is set L2

Prithwis
Mukerjee 25
Step 1
by using pairs of
itemsets in Lk-1
that
 If LK

Prithwis
Mukerjee 26
Step 2 Again : Get C3
 We combine the appropriate frequent 2-item sets from L2
(which must have the same first item) and obtain four such
itemsets each containing three items
{Bread, Cereal} 9
{Bread, Cheese} 8
{Bread, Coffee} 8
{Cheese, Coffee} 9
7
9
{Chocolate, Donuts}
{Donuts, Juice}
This is set L2
Candidate 3 item set
{Bread, Cheese, Cereal}
{Bread, Cereal, Coffee}
{Bread, Cheese, Coffee}
{Chocolate, Donut, Juice}
L2
to C3
L2
to C3

Prithwis
Mukerjee 27
Step 3 Again C3
to L3
Again Based on Minimum Support
 Since C4 cannot be formed, L4 cannot be formed so we
stop here
Candidate 3 item set Frequency
{Bread, Cheese, Cereal} 4
{Bread, Cereal, Coffee} 4
{Bread, Cheese, Coffee} 8
7{Chocolate, Donut, Juice}
Frequent 3 item set Frequency
25%
support
25%
support

Prithwis
Mukerjee 28
Step 1
by using pairs of
itemsets in Lk-1
that
 If LK

Prithwis
Mukerjee 29
The APriori Algorithm
Consists of two PARTS
 First find the frequent itemsets
 Most of the cleverness happens here
 We will do better than the naive algorithm
 Find the rules

Prithwis
Mukerjee 30
APriori : Part 2 – Find Rules
Rules will be found by looking at
 3-item sets found in L3
 2-item sets in L2 that are not subsets of L3
In each case we
 Calculate confidence (A ⇒ B )
 = P (B | A) = P(A ∩ B ) / P(A)
Some short hand
 {Bread, Cheese, Coffee } is written as { B, C, D}

Prithwis
Mukerjee 31
Rules for Finding Rules !
A 3 item frequent set { BCD} results in 6 rules
 B ⇒ CD, C ⇒ BD, D ⇒ BC
 CD ⇒ B, BD ⇒ C, BC ⇒ D
Also note that
 B ⇒ CD can also be written as
 B ⇒ D, B ⇒ C
We now look at these two 3-item sets and find
their confidence levels
 { Bread, Cheese, Coffee}
 { Chocolate, Donuts, Juice }
 From the L3
set ( the highest L set ) and note that support
for these rules is 8 and 7

Prithwis
Mukerjee 32
Rules from First of 2 Itemsets in L3
One rule drops out because confidence < 70%
 Calculate confidence (X ⇒ Y )
 = P (Y | X) = P(X ∩ Y ) / P(X)
Confidence of association rules from { Bread, Cheese, Coffee }
Rule Confidence
B => CD 8 13 0.615
C => BD 8 11 0.727
D => BC 8 9 0.889
CD => B 8 9 0.889
BD => C 8 8 1.000
BC => D 8 8 1.000
Support
of BCD
Frequency
of LHS
1 Biscuits 4
2 Bread 13
3 Cereal 10
4 Cheese 11
5 Chocolate 9
6 Coffee 9
7 10
8 Eggs 2
9 Juice 11
10 Milk 6
11 Newspaper 2
12 Pastry 1
13 Rolls 2
14 Sugar 1
15 Tea 4
16 2
Donuts
Yogurt

Prithwis
Mukerjee 33
Rules from First of 2 Itemsets in L3
Confidence of association rules from { Bread B, Cheese C, Coffee D }
Rule Confidence
B => CD 8 13 0.615
C => BD 8 11 0.727
D => BC 8 9 0.889
CD => B 8 9 0.889
BD => C 8 8 1.000
BC => D 8 8 1.000
Support
of BCD
Frequency
of LHS
{Bread, Cereal} 9
{Bread, Cheese} 8
{Bread, Coffee} 8
{Cheese, Coffee} 9
7
9
{Chocolate, Donuts}
{Donuts, Juice}

Prithwis
Mukerjee 34
Rules from Second of 2 Itemsets in L3
Rule Confidence
N => MP 7 9 0.778
M => NP 7 10 0.700
P => NM 7 11 0.636
MP => N 7 9 0.778
NP => M 7 7 1.000
NM => P 7 7 1.000
Confidence of association rules from { chocolate N, donut M, juice P}
Support
of BCD
Frequency
of LHS
1 Biscuits 4
2 Bread 13
3 Cereal 10
4 Cheese 11
5 Chocolate 9
6 Coffee 9
7 10
8 Eggs 2
9 Juice 11
10 Milk 6
11 Newspaper 2
12 Pastry 1
13 Rolls 2
14 Sugar 1
15 Tea 4
16 2
Donuts
Yogurt

Prithwis
Mukerjee 35
Rules from Second of 2 Itemsets in L3
Rule Confidence
N => MP 7 9 0.778
M => NP 7 10 0.700
P => NM 7 11 0.636
MP => N 7 9 0.778
NP => M 7 7 1.000
NM => P 7 7 1.000
Confidence of association rules from { chocolate N, donut M, juice P}
Support
of BCD
Frequency
of LHS
{Bread, Cereal} 9
{Bread, Cheese} 8
{Bread, Coffee} 8
{Cheese, Coffee} 9
7
9
{Chocolate, Donuts}
{Donuts, Juice}

Prithwis
Mukerjee 36
Set of 14 Rules obtained from L3
C => BD
C => B 1 Cheese => Bread
C => D 2 Cheese => Coffee
D => BC
D => B 3 Coffee = > Bread
D => C 4 Coffee => Cheese
CD => B 5 Cheese, Coffee => Bread
BD => C 6 Bread, Coffee => Cheese
BC => D 7 Bread, Cheese => Coffee
N => MP
N => M 8
N => P 9 Chocolate => Juice
M => NP
M => P 10
M => N 11
MP => N 12
NP => M 13
NM => P 14
Chocolate => Donuts
Donuts => Chocolate
Donuts => Juice
Donuts, Juice => Chocolate
Chocolate , Juice => Donuts
Chocolate, Donuts => Juice

Prithwis
Mukerjee 37
What about L2
?
Look for sets in L2
that are not subsets of L3
 { Bread, Cereal} is the only candidate
 Which gives are two more rules
 Bread ⇒ Cereal
 Cereal ⇒ Bread
{Bread, Cereal} 9
{Bread, Cheese} 8
{Bread, Coffee} 8
{Cheese, Coffee} 9
7
9
{Chocolate, Donuts}
{Donuts, Juice}
Frequent 3 item set Frequency

Prithwis
Mukerjee 38
Which are now added to get 16 rules
C => BD
C => B 1 Cheese => Bread
C => D 2 Cheese => Coffee
D => BC
D => B 3 Coffee = > Bread
D => C 4 Coffee => Cheese
CD => B 5 Cheese, Coffee => Bread
BD => C 6 Bread, Coffee => Cheese
BC => D 7 Bread, Cheese => Coffee
N => MP
N => M 8
N => P 9 Chocolate => Juice
M => NP
M => P 10
M => N 11
MP => N 12
NP => M 13
NM => P 14
15 Bread = > Cereal
16 Cereal => Bread
Chocolate => Donuts
Donuts => Chocolate
Donuts => Juice
Donuts, Juice => Chocolate
Chocolate , Juice => Donuts
Chocolate, Donuts => Juice

Prithwis
Mukerjee 39
So where are we ?
Apriori Algorithm
Consists of two
PARTS
 First find the frequent
itemsets
 Most of the cleverness
happens here
 We will do better than
the naive algorithm
 Find the rules
We have just
completed the two
PARTS
Overall approach to
ARM is as follows
 List all itemsets
 Find frequency of each
 Identify “frequent sets”
 Based on support
 Search for Rules within
“frequent sets”
 Based on confidence
Naive Algorithm
 Exponential Time
A Priori Algoritm
 Polynomial Time

Prithwis
Mukerjee 40
Observations
Actual values of support and confidence
 25%, 75% are very high values
 In reality one works with far smaller values
“Interestingness” of a rule
 Since X, Y are related events – not independent – hence
P(X ∩ Y) ≠ P(X)P(Y)
 Interestingness ≈ P(X ∩ Y) – P(X)P(Y)
Triviality of rules
 Rules involving very frequent items can be trivial
 You always buy potatoes when you go to the market and
so you can get rules that connect potatoes to many things
Inexplicable rules
 Toothbrush was the most frequent item on Tuesday ??

Prithwis
Mukerjee 41
Better Algorithms
Enhancements to
the Apriori
Algorithm
 AP-TID
 Direct Hashing and
Pruning (DHP)
 Dynamic Itemset
Counting (DIC)
Frequent Pattern (FP)
Tree
 Only frequent items are
needed to find association
rules – so ignore others !
 Move the data of only
frequent items to a more
compact and efficient
structure
 A Tree structure or a directed
graph is used
 Multiple transactions with
same (frequent) items are
stored once with a count
information

Prithwis
Mukerjee 42
Software Support
KDNuggets.com
 Excellent collections of software available
Bart Goethals
 Free software for Apriori, FP-Tree
ARMiner
 GNU Open Source software from UMass/Boston
DMII
 National University of Singapore
DB2 Intelligent Data Miner
 IBM Corporation
 Equivalent software available from other vendors as well

Data mining arm-2009-v0

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (10)

Similar to Data mining arm-2009-v0

Similar to Data mining arm-2009-v0 (20)

More from Prithwis Mukerjee

More from Prithwis Mukerjee (20)

Recently uploaded

Recently uploaded (20)

Data mining arm-2009-v0