Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Data mining arm-2009-v0

• 685 views

• Comment goes here.
Are you sure you want to
Be the first to comment
Be the first to like this

Total Views
685
On Slideshare
0
From Embeds
0
Number of Embeds
0

Shares
51
0
Likes
0

No embeds

Report content

No notes for slide

Transcript

• 1. Data Mining Association Rules Mining or Market Basket Analysis Prithwis Mukerjee, Ph.D.
• 2. Let us describe the problem ...
• A retailer sells the following items
• And we assume that the shopkeeper keeps track of what each customer purchases :
• He needs to know which items are generally sold together
• 3. Associations
• Rules expressing relations between items in a “Market Basket”
• { Sugar and Tea } => {Biscuits}
• Is it true, that if a customer buys Sugar and Tea, she will also buy biscuits ?
• If so, then
• These items should be ordered together
• But discounts should not be given on these items at the same time !
• We can make a guess but
• It would be better if we could structure this problem in terms of mathematics
• 4. Basic Concepts
• Set of n Items on Sale
• I = { i 1 , i 2 , i 3 , i 4 , i 5 , i 5 , ......, i n }
• Transaction
• A subset of I : T  I
• A set of items purchased in an individual transaction
• With each transaction having m items
• t i = { i 1 , i 2 , i 3 , i 4 , i 5 , i 5 , ......, i m } with m < n
• If we have N transactions then we have t 1 , t 2 ,t 3 ,.. t N as unique identifier for each transaction
• D is our total data about all N transactions
• D = {t 1 , t 2 ,t 3 ,.. t N }
• 5. An Association Rule
• Whenever X appears, Y also appears
• X  Y
• X  I, Y  I, X  Y = 
• X and Y may be
• Single items or
• Sets of items – in which the same item does not appear
• X is referred to as the antecedent
• Y is referred to as the consequent
• Whether a rule like this exists is the focus of our analysis
• 6. Two key concepts
• Support ( or prevalence)
• How often does X and Y appear together in the basket ?
• If this number is very low then it is not worth examining
• Expressed as a fraction of the total number of transactions
• Say 10% or 0.1
• Confidence ( or predictability )
• Of all the occurances of X, in what fraction does Y also appear ?
• Expressed as a fraction of all transactions containing X
• Say 80% or 0.8
• We are interested in rules that have a
• Minimum value of support : say 25%
• Minimum value of confidence : say 75%
• 7. Mathematically speaking ...
• Support (X)
• = (Number of times X appears ) / N
• = P(X)
• Support (XY)
• = (Number of times X and Y appears ) / N
• = P(X  Y)
• Confidence (X  Y )
• = Support (XY) / Support(X)
• = Probability (X  Y) / P(X)
• = Conditional Probability P( Y | X)
• Lift : an optional term
• Measures the power of association
• P( Y | X) / P(Y)
• 8. The task at hand ...
• Given a large set of transactions, we seek a procedure ( or algorithm )
• That will discover all association rules
• That have a minimum support of p%
• And a minimum confidence level of q%
• And to do so in an efficient manner
• Algorithms
• The Naive or Brute Force Method
• The Improved Naive algorithm
• The Apriori Algorithm
• Improvements to the Apriori algorithm
• FP ( Frequent Pattern ) Algorithm
• 9. Let us try the Naive Algorithm manually !
• This is the set of transaction that we have ...
• We want to find Association Rules with
• Minimum 50% support and
• Minimum 75% confidence
• 10. Itemsets & Frequencies
• Which sets are frequent ?
• Since we are looking for a support of 50% , we need a set to appear in 2 out of 4 transactions
• = (# of times X appears ) / N
• = P(X)
• 6 sets meet this criteria
• 11. A closer look at the “Frequent Set”
• Look at itemsets with more than 1 item
• 4 rules are possible
• Look for confidence levels
• Confidence (X  Y )
• = Support (XY) / Support(X)
• 12. A closer look at the “Frequent Set”
• Look at itemsets with more than 1 item
• 4 rules are possible
• Look for confidence levels
• Confidence (X  Y )
• = Support (XY) / Support(X)
• 13. The Big Picture
• List all itemsets
• Find frequency of each
• Identify “frequent sets”
• Based on support
• Search for Rules within “frequent sets”
• Based on confidence
• 14. Looking Beyond the Retail Store
• Counter Terrorism
• Track phone calls made or received from a particular number every day
• Is an incoming call from a particular number followed by a call to another number ?
• Are there any sets of numbers that are always called together ?
• Expand the item sets to include
• Electronic fund transfers
• Travel between two locations
• Boarding cards
• Railway reservation
All data is available in electronic format
• 15. Major Problem
• Exponential Growth of number of Itemsets
• 4 items : 16 = 2 4 members
• n items : 2 n members
• As n becomes larger, the problem cannot be solved anymore in finite time
• All attempts are made to reduce the number of Item sets to be processed
• “ Improved” Naive algorithm
• Ignore sets with zero frequency
• 16. The APriori Algorithm
• Consists of two PARTS
• First find the frequent itemsets
• Most of the cleverness happens here
• We will do better than the naive algorithm
• Find the rules
• This is relatively simpler
• 17. APriori : Part 1 - Frequent Sets
• Step 1
• Scan all transactions and find all frequent items that have support above p%. This is set L 1
• Step 2 : Apriori-Gen
• Build potential sets of k items from the L k-1 by using pairs of itemsets in L k-1 that has the first k-2 items common and one remaining item from each member of the pair.
• This is Candidate set C K
• Step 3 : Find Frequent Item Sets again
• Scan all transactions and find frequency of sets in C K that are frequent : This gives L K
• If L K is empty, stop, else go back to step 2
• 18. APriori : Part 1 - Frequent Sets
• Step 1
• Scan all transactions and find all frequent items that have support above p% - This is set L 1
• 19. Example
• We have 16 items spread over 25 transactions
• 20. Apriori : Step 1 – Computing L 1
• Count frequency for each item and exclude those that are below minimum support
25% support This is set L 1
• 21. APriori : Part 1 - Frequent Sets
• Step 1
• Scan all transactions and find all frequent items that have support above p%. This is set L 1
• Step 2 : Apriori-Gen
• Build potential sets of k items from the L k-1 by using pairs of itemsets in L k-1 that has the first k-2 items common and one remaining item from each member of the pair.
• This is Candidate set C K
• 22. Step 2 : Computing C 2
• Given L 1 , we now form candidate pairs of C 2 . The 7 items in form 21 pairs : d*(d-1)/2 – this is a quadratic function and not a exponential function.
L 1 to C 2
• 23. APriori : Part 1 - Frequent Sets
• Step 1
• Scan all transactions and find all frequent items that have support above p%. This is set L 1
• Step 2 : Apriori-Gen
• Build potential sets of k items from the L k-1 by using pairs of itemsets in L k-1 that has the first k-2 items common and one remaining item from each member of the pair.
• This is Candidate set C K
• Step 3 : Find Frequent Item Sets again
• Scan all transactions and find frequency of sets in C K that are frequent : This gives L K
• If L K is empty, stop, else go back to step 2
• 24. From C 2 to L 2 based on minimum support
• This is a computationally intensive step
• L 2 is not empty
25% support This is set L 2
• 25. APriori : Part 1 - Frequent Sets
• Step 1
• Scan all transactions and find all frequent items that have support above p%. This is set L 1
• Step 2 : Apriori-Gen
• Build potential sets of k items from the L k-1 by using pairs of itemsets in L k-1 that has the first k-2 items common and one remaining item from each member of the pair.
• This is Candidate set C K
• Step 3 : Find Frequent Item Sets again
• Scan all transactions and find frequency of sets in C K that are frequent : This gives L K
• If L K is empty, stop, else go back to step 2
• 26. Step 2 Again : Get C 3
• We combine the appropriate frequent 2-item sets from L 2 (which must have the same first item) and obtain four such itemsets each containing three items
This is set L 2 L 2 to C 3
• 27. Step 3 Again C 3 to L 3
• Again Based on Minimum Support
• Since C4 cannot be formed, L4 cannot be formed so we stop here
25% support
• 28. APriori : Part 1 - Frequent Sets
• Step 1
• Scan all transactions and find all frequent items that have support above p%. This is set L 1
• Step 2 : Apriori-Gen
• Build potential sets of k items from the L k-1 by using pairs of itemsets in L k-1 that has the first k-2 items common and one remaining item from each member of the pair.
• This is Candidate set C K
• Step 3 : Find Frequent Item Sets again
• Scan all transactions and find frequency of sets in C K that are frequent : This gives L K
• If L K is empty, stop, else go back to step 2
• 29. The APriori Algorithm
• Consists of two PARTS
• First find the frequent itemsets
• Most of the cleverness happens here
• We will do better than the naive algorithm
• Find the rules
• This is relatively simpler
• 30. APriori : Part 2 – Find Rules
• Rules will be found by looking at
• 3-item sets found in L3
• 2-item sets in L2 that are not subsets of L3
• In each case we
• Calculate confidence (A  B )
• = P (B | A) = P(A  B ) / P(A)
• Some short hand
• {Bread, Cheese, Coffee } is written as { B, C, D}
• 31. Rules for Finding Rules !
• A 3 item frequent set { BCD} results in 6 rules
• B  CD, C  BD, D  BC
• CD  B, BD  C, BC  D
• Also note that
• B  CD can also be written as
• B  D, B  C
• We now look at these two 3-item sets and find their confidence levels
• { Chocolate, Donuts, Juice }
• From the L 3 set ( the highest L set ) and note that support for these rules is 8 and 7
• 32. Rules from First of 2 Itemsets in L 3
• One rule drops out because confidence < 70%
• Calculate confidence (X  Y )
• = P (Y | X) = P(X  Y ) / P(X)
• 33. Rules from First of 2 Itemsets in L 3
• One rule drops out because confidence < 70%
• 34. Rules from Second of 2 Itemsets in L 3
• One rule drops out because confidence < 70%
• 35. Rules from Second of 2 Itemsets in L 3
• One rule drops out because confidence < 70%
• 36. Set of 14 Rules obtained from L 3
• 37. What about L 2 ?
• Look for sets in L 2 that are not subsets of L 3
• { Bread, Cereal} is the only candidate
• Which gives are two more rules
• 38. Which are now added to get 16 rules
• 39. So where are we ?
• Apriori Algorithm Consists of two PARTS
• First find the frequent itemsets
• Most of the cleverness happens here
• We will do better than the naive algorithm
• Find the rules
• This is relatively simpler
• We have just completed the two PARTS
• Overall approach to ARM is as follows
• List all itemsets
• Find frequency of each
• Identify “frequent sets”
• Based on support
• Search for Rules within “frequent sets”
• Based on confidence
• Naive Algorithm
• Exponential Time
• A Priori Algoritm
• Polynomial Time
• 40. Observations
• Actual values of support and confidence
• 25%, 75% are very high values
• In reality one works with far smaller values
• “ Interestingness” of a rule
• Since X, Y are related events – not independent – hence P(X  Y)  P(X)P(Y)
• Interestingness  P(X  Y) – P(X)P(Y)
• Triviality of rules
• Rules involving very frequent items can be trivial
• You always buy potatoes when you go to the market and so you can get rules that connect potatoes to many things
• Inexplicable rules
• Toothbrush was the most frequent item on Tuesday ??
• 41. Better Algorithms
• Enhancements to the Apriori Algorithm
• AP-TID
• Direct Hashing and Pruning (DHP)
• Dynamic Itemset Counting (DIC)
• Frequent Pattern (FP) Tree
• Only frequent items are needed to find association rules – so ignore others !
• Move the data of only frequent items to a more compact and efficient structure
• A Tree structure or a directed graph is used
• Multiple transactions with same (frequent) items are stored once with a count information
• 42. Software Support
• KDNuggets.com
• Excellent collections of software available
• Bart Goethals
• Free software for Apriori, FP-Tree
• ARMiner
• GNU Open Source software from UMass/Boston
• DMII
• National University of Singapore
• DB2 Intelligent Data Miner
• IBM Corporation
• Equivalent software available from other vendors as well