Successfully reported this slideshow.
Upcoming SlideShare
×

# Data mining arm-2009-v0

1,656 views

Published on

Published in: Technology, Self Improvement
• Full Name
Comment goes here.

Are you sure you want to Yes No

Are you sure you want to  Yes  No

Are you sure you want to  Yes  No

### Data mining arm-2009-v0

1. 1. Data Mining Association Rules Mining or Market Basket Analysis Prithwis Mukerjee, Ph.D.
2. 2. Prithwis Mukerjee 2 Let us describe the problem ... A retailer sells the following items  And we assume that the shopkeeper keeps track of what each customer purchases :  He needs to know which items are generally sold together Bread Cheese Coffee Juice Milk Tea BiscuitsSugar Newspaper Items 10 Bread, Cheese, Newspaper 20 Bread, Cheese, Juice 30 Bread, Milk 40 Cheese, Juice, Milk, Coffee 50 Sugar, Tea, Coffee, Biscuits, Newspaper 60 Sugar, Tea, Coffee, Biscuits, Milk, Juice, Newspaper 70 Bread, Cheese 80 Bread, Cheese, Juice, Coffee 90 Bread, Milk 100 Sugar, Tea, Coffee, Bread, Milk, Juice, Newspaper Trans ID
3. 3. Prithwis Mukerjee 3 Associations Rules expressing relations between items in a “Market Basket” { Sugar and Tea } => {Biscuits}  Is it true, that if a customer buys Sugar and Tea, she will also buy biscuits ?  If so, then  These items should be ordered together  But discounts should not be given on these items at the same time ! We can make a guess but  It would be better if we could structure this problem in terms of mathematics
4. 4. Prithwis Mukerjee 4 Basic Concepts Set of n Items on Sale  I = { i1 , i2 , i3 , i4 , i5 , i5 , ......, in } Transaction  A subset of I : T ⊆ I  A set of items purchased in an individual transaction  With each transaction having m items  ti = { i1 , i2 , i3 , i4 , i5 , i5 , ......, im } with m < n  If we have N transactions then we have t1 , t2 ,t3 ,.. tN as unique identifier for each transaction D is our total data about all N transactions  D = {t1 , t2 ,t3 ,.. tN }
5. 5. Prithwis Mukerjee 5 An Association Rule Whenever X appears, Y also appears  X ⇒ Y  X ⊆ I, Y ⊆ I, X ∩ Y = ∅ X and Y may be  Single items or  Sets of items – in which the same item does not appear X is referred to as the antecedent Y is referred to as the consequent Whether a rule like this exists is the focus of our analysis
6. 6. Prithwis Mukerjee 6 Two key concepts Support ( or prevalence)  How often does X and Y appear together in the basket ?  If this number is very low then it is not worth examining  Expressed as a fraction of the total number of transactions  Say 10% or 0.1 Confidence ( or predictability )  Of all the occurances of X, in what fraction does Y also appear ?  Expressed as a fraction of all transactions containing X  Say 80% or 0.8 We are interested in rules that have a  Minimum value of support : say 25%  Minimum value of confidence : say 75%
7. 7. Prithwis Mukerjee 7 Mathematically speaking ... Support (X)  = (Number of times X appears ) / N  = P(X) Support (XY)  = (Number of times X and Y appears ) / N  = P(X ∩ Y) Confidence (X ⇒ Y)  = Support (XY) / Support(X)  = Probability (X ∩ Y) / P(X)  = Conditional Probability P( Y | X) Lift : an optional term  Measures the power of association  P( Y | X) / P(Y)
8. 8. Prithwis Mukerjee 8 The task at hand ... Given a large set of transactions, we seek a procedure ( or algorithm )  That will discover all association rules  That have a minimum support of p%  And a minimum confidence level of q%  And to do so in an efficient manner Algorithms  The Naive or Brute Force Method  The Improved Naive algorithm  The Apriori Algorithm  Improvements to the Apriori algorithm  FP ( Frequent Pattern ) Algorithm
9. 9. Prithwis Mukerjee 9 Let us try the Naive Algorithm manually ! This is the set of transaction that we have ...  We want to find Association Rules with  Minimum 50% support and  Minimum 75% confidence Items 100 Bread, Cheese 200 Bread, Cheese, Juice 300 Bread, Milk 400 Cheese, Juice, Milk Trans ID
10. 10. Prithwis Mukerjee 10 Itemsets & Frequencies Which sets are frequent ?  Since we are looking for a support of 50%, we need a set to appear in 2 out of 4 transactions  = (# of times X appears ) / N  = P(X)  6 sets meet this criteria Item Sets Frequency {Bread} 3 {Cheese } 3 {Juice} 2 {Milk} 2 {Bread, Cheese} 2 {Bread, Juice } 1 {Bread, Milk} 1 {Cheese, Juice} 2 {Cheese, Milk} 1 {Juice, Milk} 1 {Bread, Cheese, Juice} 1 {Bread, Cheese, Milk} 0 {Bread, Juice, Milk} 0 {Cheese, Juice, Milk} 1 {Bread, Cheese, Juice, Milk} 0
11. 11. Prithwis Mukerjee 11 A closer look at the “Frequent Set” Look at itemsets with more than 1 item  {Bread, Cheese}, {Cheese, Juice}  4 rules are possible Look for confidence levels  Confidence (X ⇒ Y)  = Support (XY) / Support(X) Item Sets Frequency Rule Confidence {Bread} 3 Bread => Cheese 2 / 3 67.00% {Cheese } 3 {Juice} 2 Cheese => Bread 2 / 3 67.00% {Milk} 2 {Bread, Cheese} 2 Cheese => Juice 2 / 3 67.00% {Cheese, Juice} 2 Juice => Cheese 2 / 2 100.00%
12. 12. Prithwis Mukerjee 12 A closer look at the “Frequent Set” Look at itemsets with more than 1 item  {Bread, Cheese}, {Cheese, Juice}  4 rules are possible Look for confidence levels  Confidence (X ⇒ Y)  = Support (XY) / Support(X) Item Sets Frequency Rule Confidence {Bread} 3 Bread => Cheese 2 / 3 67.00% {Cheese } 3 {Juice} 2 Cheese => Bread 2 / 3 67.00% {Milk} 2 {Bread, Cheese} 2 Cheese => Juice 2 / 3 67.00% {Cheese, Juice} 2 Juice => Cheese 2 / 2 100.00%
13. 13. Prithwis Mukerjee 13 The Big Picture List all itemsets  Find frequency of each Identify “frequent sets”  Based on support Search for Rules within “frequent sets”  Based on confidence
14. 14. Prithwis Mukerjee 14 Looking Beyond the Retail Store Counter Terrorism  Track phone calls made or received from a particular number every day  Is an incoming call from a particular number followed by a call to another number ?  Are there any sets of numbers that are always called together ? Expand the item sets to include  Electronic fund transfers  Travel between two locations  Boarding cards  Railway reservation All data is available in electronic format
15. 15. Prithwis Mukerjee 15 Major Problem Exponential Growth of number of Itemsets  4 items : 16 = 24 members  n items : 2n members  As n becomes larger, the problem cannot be solved anymore in finite time All attempts are made to reduce the number of Item sets to be processed “Improved” Naive algorithm  Ignore sets with zero frequency Item Sets Frequency {Bread} 3 {Cheese } 3 {Juice} 2 {Milk} 2 {Bread, Cheese} 2 {Bread, Juice } 1 {Bread, Milk} 1 {Cheese, Juice} 2 {Cheese, Milk} 1 {Juice, Milk} 1 {Bread, Cheese, Juice} 1 {Bread, Cheese, Milk} 0 {Bread, Juice, Milk} 0 {Cheese, Juice, Milk} 1 {Bread, Cheese, Juice, Milk} 0
16. 16. Prithwis Mukerjee 16 The APriori Algorithm Consists of two PARTS  First find the frequent itemsets  Most of the cleverness happens here  We will do better than the naive algorithm  Find the rules  This is relatively simpler
17. 17. Prithwis Mukerjee 17 APriori : Part 1 - Frequent Sets Step 1  Scan all transactions and find all frequent items that have support above p%. This is set L1 Step 2 : Apriori-Gen  Build potential sets of k items from the Lk-1 by using pairs of itemsets in Lk-1 that has the first k-2 items common and one remaining item from each member of the pair.  This is Candidate set CK Step 3 : Find Frequent Item Sets again  Scan all transactions and find frequency of sets in CK that are frequent : This gives LK  If LK is empty, stop, else go back to step 2
18. 18. Prithwis Mukerjee 18 APriori : Part 1 - Frequent Sets Step 1  Scan all transactions and find all frequent items that have support above p% - This is set L1
20. 20. Prithwis Mukerjee 20 Apriori : Step 1 – Computing L1 Count frequency for each item and exclude those that are below minimum support Item No Item Name Frequency 1 Biscuits 4 2 Bread 13 3 Cereal 10 4 Cheese 11 5 Chocolate 9 6 Coffee 9 7 10 8 Eggs 2 9 Juice 11 10 Milk 6 11 Newspaper 2 12 Pastry 1 13 Rolls 2 14 Sugar 1 15 Tea 4 16 2 Donuts Yogurt Item No Item Name Frequency 2 Bread 13 3 Cereal 10 4 Cheese 11 5 Chocolate 9 6 Coffee 9 7 10 9 Juice 11 Donuts 25% support 25% support This is set L1
21. 21. Prithwis Mukerjee 21 APriori : Part 1 - Frequent Sets Step 1  Scan all transactions and find all frequent items that have support above p%. This is set L1 Step 2 : Apriori-Gen  Build potential sets of k items from the Lk-1 by using pairs of itemsets in Lk-1 that has the first k-2 items common and one remaining item from each member of the pair.  This is Candidate set CK
22. 22. Prithwis Mukerjee 22 Step 2 : Computing C2  Given L1 , we now form candidate pairs of C2 . The 7 items in form 21 pairs : d*(d-1)/2 – this is a quadratic function and not a exponential function. 1 {Bread, Cereal} 2 {Bread, Cheese} 3 {Bread, Chocolate} 4 {Bread, Coffee} 5 6 {Bread,Juice} 7 {Cereal, Cheese} 8 {Cereal, Coffee} 9 {Cereal, Chocolate} 10 11 {Cereal, Juice} 12 {Cheese, Chocolate} 13 {Cheese, Coffee} 14 15 {Cheese, Juice} 16 {Chocolate, Coffee} 17 18 {Chocolate, Juice} 19 20 {Coffee, Juice} 21 {Bread, Donuts} {Cereal, Donuts} {Cheese, Donuts} {Chocolate, Donuts} {Coffee, Donuts} {Donuts, Juice} Item No Item Name Frequency 2 Bread 13 3 Cereal 10 4 Cheese 11 5 Chocolate 9 6 Coffee 9 7 10 9 Juice 11 Donuts L1 to C2 L1 to C2
23. 23. Prithwis Mukerjee 23 APriori : Part 1 - Frequent Sets Step 1  Scan all transactions and find all frequent items that have support above p%. This is set L1 Step 2 : Apriori-Gen  Build potential sets of k items from the Lk-1 by using pairs of itemsets in Lk-1 that has the first k-2 items common and one remaining item from each member of the pair.  This is Candidate set CK Step 3 : Find Frequent Item Sets again  Scan all transactions and find frequency of sets in CK that are frequent : This gives LK  If LK is empty, stop, else go back to step 2
24. 24. Prithwis Mukerjee 24 From C2 to L2 based on minimum support Candidate 2-Item Set Freq {Bread, Cereal} 9 {Bread, Cheese} 8 {Bread, Chocolate} 4 {Bread, Coffee} 8 4 {Bread,Juice} 6 {Cereal, Cheese} 5 {Cereal, Coffee} 4 {Cereal, Chocolate} 5 4 {Cereal, Juice} 6 {Cheese, Chocolate} 4 {Cheese, Coffee} 9 3 {Cheese, Juice} 4 {Chocolate, Coffee} 1 7 {Chocolate, Juice} 7 1 {Coffee, Juice} 2 9 {Bread, Donuts} {Cereal, Donuts} {Cheese, Donuts} {Chocolate, Donuts} {Coffee, Donuts} {Donuts, Juice} Frequent 2-Item Set Freq {Bread, Cereal} 9 {Bread, Cheese} 8 {Bread, Coffee} 8 {Cheese, Coffee} 9 7 {Chocolate, Juice} 7 9 {Chocolate, Donuts} {Donuts, Juice} 25% support 25% support  This is a computationally intensive step  L2 is not empty This is set L2
25. 25. Prithwis Mukerjee 25 APriori : Part 1 - Frequent Sets Step 1  Scan all transactions and find all frequent items that have support above p%. This is set L1 Step 2 : Apriori-Gen  Build potential sets of k items from the Lk-1 by using pairs of itemsets in Lk-1 that has the first k-2 items common and one remaining item from each member of the pair.  This is Candidate set CK Step 3 : Find Frequent Item Sets again  Scan all transactions and find frequency of sets in CK that are frequent : This gives LK  If LK is empty, stop, else go back to step 2
26. 26. Prithwis Mukerjee 26 Step 2 Again : Get C3  We combine the appropriate frequent 2-item sets from L2 (which must have the same first item) and obtain four such itemsets each containing three items Frequent 2-Item Set Freq {Bread, Cereal} 9 {Bread, Cheese} 8 {Bread, Coffee} 8 {Cheese, Coffee} 9 7 {Chocolate, Juice} 7 9 {Chocolate, Donuts} {Donuts, Juice} This is set L2 Candidate 3 item set {Bread, Cheese, Cereal} {Bread, Cereal, Coffee} {Bread, Cheese, Coffee} {Chocolate, Donut, Juice} L2 to C3 L2 to C3
27. 27. Prithwis Mukerjee 27 Step 3 Again C3 to L3 Again Based on Minimum Support  Since C4 cannot be formed, L4 cannot be formed so we stop here Candidate 3 item set Frequency {Bread, Cheese, Cereal} 4 {Bread, Cereal, Coffee} 4 {Bread, Cheese, Coffee} 8 7{Chocolate, Donut, Juice} Frequent 3 item set Frequency {Bread, Cheese, Coffee} 8 7{Chocolate, Donut, Juice} 25% support 25% support
28. 28. Prithwis Mukerjee 28 APriori : Part 1 - Frequent Sets Step 1  Scan all transactions and find all frequent items that have support above p%. This is set L1 Step 2 : Apriori-Gen  Build potential sets of k items from the Lk-1 by using pairs of itemsets in Lk-1 that has the first k-2 items common and one remaining item from each member of the pair.  This is Candidate set CK Step 3 : Find Frequent Item Sets again  Scan all transactions and find frequency of sets in CK that are frequent : This gives LK  If LK is empty, stop, else go back to step 2
29. 29. Prithwis Mukerjee 29 The APriori Algorithm Consists of two PARTS  First find the frequent itemsets  Most of the cleverness happens here  We will do better than the naive algorithm  Find the rules  This is relatively simpler
30. 30. Prithwis Mukerjee 30 APriori : Part 2 – Find Rules Rules will be found by looking at  3-item sets found in L3  2-item sets in L2 that are not subsets of L3 In each case we  Calculate confidence (A ⇒ B )  = P (B | A) = P(A ∩ B ) / P(A) Some short hand  {Bread, Cheese, Coffee } is written as { B, C, D}
31. 31. Prithwis Mukerjee 31 Rules for Finding Rules ! A 3 item frequent set { BCD} results in 6 rules  B ⇒ CD, C ⇒ BD, D ⇒ BC  CD ⇒ B, BD ⇒ C, BC ⇒ D Also note that  B ⇒ CD can also be written as  B ⇒ D, B ⇒ C We now look at these two 3-item sets and find their confidence levels  { Bread, Cheese, Coffee}  { Chocolate, Donuts, Juice }  From the L3 set ( the highest L set ) and note that support for these rules is 8 and 7
32. 32. Prithwis Mukerjee 32 Rules from First of 2 Itemsets in L3 One rule drops out because confidence < 70%  Calculate confidence (X ⇒ Y )  = P (Y | X) = P(X ∩ Y ) / P(X) Confidence of association rules from { Bread, Cheese, Coffee } Rule Confidence B => CD 8 13 0.615 C => BD 8 11 0.727 D => BC 8 9 0.889 CD => B 8 9 0.889 BD => C 8 8 1.000 BC => D 8 8 1.000 Support of BCD Frequency of LHS Item No Item Name Frequency 1 Biscuits 4 2 Bread 13 3 Cereal 10 4 Cheese 11 5 Chocolate 9 6 Coffee 9 7 10 8 Eggs 2 9 Juice 11 10 Milk 6 11 Newspaper 2 12 Pastry 1 13 Rolls 2 14 Sugar 1 15 Tea 4 16 2 Donuts Yogurt
33. 33. Prithwis Mukerjee 33 Rules from First of 2 Itemsets in L3 One rule drops out because confidence < 70% Confidence of association rules from { Bread B, Cheese C, Coffee D } Rule Confidence B => CD 8 13 0.615 C => BD 8 11 0.727 D => BC 8 9 0.889 CD => B 8 9 0.889 BD => C 8 8 1.000 BC => D 8 8 1.000 Support of BCD Frequency of LHS Frequent 2-Item Set Freq {Bread, Cereal} 9 {Bread, Cheese} 8 {Bread, Coffee} 8 {Cheese, Coffee} 9 7 {Chocolate, Juice} 7 9 {Chocolate, Donuts} {Donuts, Juice}
34. 34. Prithwis Mukerjee 34 Rules from Second of 2 Itemsets in L3 One rule drops out because confidence < 70% Rule Confidence N => MP 7 9 0.778 M => NP 7 10 0.700 P => NM 7 11 0.636 MP => N 7 9 0.778 NP => M 7 7 1.000 NM => P 7 7 1.000 Confidence of association rules from { chocolate N, donut M, juice P} Support of BCD Frequency of LHS Item No Item Name Frequency 1 Biscuits 4 2 Bread 13 3 Cereal 10 4 Cheese 11 5 Chocolate 9 6 Coffee 9 7 10 8 Eggs 2 9 Juice 11 10 Milk 6 11 Newspaper 2 12 Pastry 1 13 Rolls 2 14 Sugar 1 15 Tea 4 16 2 Donuts Yogurt
35. 35. Prithwis Mukerjee 35 Rules from Second of 2 Itemsets in L3 One rule drops out because confidence < 70% Rule Confidence N => MP 7 9 0.778 M => NP 7 10 0.700 P => NM 7 11 0.636 MP => N 7 9 0.778 NP => M 7 7 1.000 NM => P 7 7 1.000 Confidence of association rules from { chocolate N, donut M, juice P} Support of BCD Frequency of LHS Frequent 2-Item Set Freq {Bread, Cereal} 9 {Bread, Cheese} 8 {Bread, Coffee} 8 {Cheese, Coffee} 9 7 {Chocolate, Juice} 7 9 {Chocolate, Donuts} {Donuts, Juice}
36. 36. Prithwis Mukerjee 36 Set of 14 Rules obtained from L3 C => BD C => B 1 Cheese => Bread C => D 2 Cheese => Coffee D => BC D => B 3 Coffee = > Bread D => C 4 Coffee => Cheese CD => B 5 Cheese, Coffee => Bread BD => C 6 Bread, Coffee => Cheese BC => D 7 Bread, Cheese => Coffee N => MP N => M 8 N => P 9 Chocolate => Juice M => NP M => P 10 M => N 11 MP => N 12 NP => M 13 NM => P 14 Chocolate => Donuts Donuts => Chocolate Donuts => Juice Donuts, Juice => Chocolate Chocolate , Juice => Donuts Chocolate, Donuts => Juice
37. 37. Prithwis Mukerjee 37 What about L2 ? Look for sets in L2 that are not subsets of L3  { Bread, Cereal} is the only candidate  Which gives are two more rules  Bread ⇒ Cereal  Cereal ⇒ Bread Frequent 2-Item Set Freq {Bread, Cereal} 9 {Bread, Cheese} 8 {Bread, Coffee} 8 {Cheese, Coffee} 9 7 {Chocolate, Juice} 7 9 {Chocolate, Donuts} {Donuts, Juice} Frequent 3 item set Frequency {Bread, Cheese, Coffee} 8 7{Chocolate, Donut, Juice}
38. 38. Prithwis Mukerjee 38 Which are now added to get 16 rules C => BD C => B 1 Cheese => Bread C => D 2 Cheese => Coffee D => BC D => B 3 Coffee = > Bread D => C 4 Coffee => Cheese CD => B 5 Cheese, Coffee => Bread BD => C 6 Bread, Coffee => Cheese BC => D 7 Bread, Cheese => Coffee N => MP N => M 8 N => P 9 Chocolate => Juice M => NP M => P 10 M => N 11 MP => N 12 NP => M 13 NM => P 14 15 Bread = > Cereal 16 Cereal => Bread Chocolate => Donuts Donuts => Chocolate Donuts => Juice Donuts, Juice => Chocolate Chocolate , Juice => Donuts Chocolate, Donuts => Juice
39. 39. Prithwis Mukerjee 39 So where are we ? Apriori Algorithm Consists of two PARTS  First find the frequent itemsets  Most of the cleverness happens here  We will do better than the naive algorithm  Find the rules  This is relatively simpler We have just completed the two PARTS Overall approach to ARM is as follows  List all itemsets  Find frequency of each  Identify “frequent sets”  Based on support  Search for Rules within “frequent sets”  Based on confidence Naive Algorithm  Exponential Time A Priori Algoritm  Polynomial Time
40. 40. Prithwis Mukerjee 40 Observations Actual values of support and confidence  25%, 75% are very high values  In reality one works with far smaller values “Interestingness” of a rule  Since X, Y are related events – not independent – hence P(X ∩ Y) ≠ P(X)P(Y)  Interestingness ≈ P(X ∩ Y) – P(X)P(Y) Triviality of rules  Rules involving very frequent items can be trivial  You always buy potatoes when you go to the market and so you can get rules that connect potatoes to many things Inexplicable rules  Toothbrush was the most frequent item on Tuesday ??
41. 41. Prithwis Mukerjee 41 Better Algorithms Enhancements to the Apriori Algorithm  AP-TID  Direct Hashing and Pruning (DHP)  Dynamic Itemset Counting (DIC) Frequent Pattern (FP) Tree  Only frequent items are needed to find association rules – so ignore others !  Move the data of only frequent items to a more compact and efficient structure  A Tree structure or a directed graph is used  Multiple transactions with same (frequent) items are stored once with a count information
42. 42. Prithwis Mukerjee 42 Software Support KDNuggets.com  Excellent collections of software available Bart Goethals  Free software for Apriori, FP-Tree ARMiner  GNU Open Source software from UMass/Boston DMII  National University of Singapore DB2 Intelligent Data Miner  IBM Corporation  Equivalent software available from other vendors as well