• Like

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
685
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
51
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Data Mining Association Rules Mining or Market Basket Analysis Prithwis Mukerjee, Ph.D.
  • 2. Let us describe the problem ...
    • A retailer sells the following items
      • And we assume that the shopkeeper keeps track of what each customer purchases :
      • He needs to know which items are generally sold together
  • 3. Associations
    • Rules expressing relations between items in a “Market Basket”
    • { Sugar and Tea } => {Biscuits}
      • Is it true, that if a customer buys Sugar and Tea, she will also buy biscuits ?
      • If so, then
        • These items should be ordered together
        • But discounts should not be given on these items at the same time !
    • We can make a guess but
      • It would be better if we could structure this problem in terms of mathematics
  • 4. Basic Concepts
    • Set of n Items on Sale
      • I = { i 1 , i 2 , i 3 , i 4 , i 5 , i 5 , ......, i n }
    • Transaction
      • A subset of I : T  I
      • A set of items purchased in an individual transaction
      • With each transaction having m items
      • t i = { i 1 , i 2 , i 3 , i 4 , i 5 , i 5 , ......, i m } with m < n
      • If we have N transactions then we have t 1 , t 2 ,t 3 ,.. t N as unique identifier for each transaction
    • D is our total data about all N transactions
      • D = {t 1 , t 2 ,t 3 ,.. t N }
  • 5. An Association Rule
    • Whenever X appears, Y also appears
      • X  Y
      • X  I, Y  I, X  Y = 
    • X and Y may be
      • Single items or
      • Sets of items – in which the same item does not appear
    • X is referred to as the antecedent
    • Y is referred to as the consequent
    • Whether a rule like this exists is the focus of our analysis
  • 6. Two key concepts
    • Support ( or prevalence)
      • How often does X and Y appear together in the basket ?
      • If this number is very low then it is not worth examining
      • Expressed as a fraction of the total number of transactions
      • Say 10% or 0.1
    • Confidence ( or predictability )
      • Of all the occurances of X, in what fraction does Y also appear ?
      • Expressed as a fraction of all transactions containing X
      • Say 80% or 0.8
    • We are interested in rules that have a
      • Minimum value of support : say 25%
      • Minimum value of confidence : say 75%
  • 7. Mathematically speaking ...
    • Support (X)
      • = (Number of times X appears ) / N
      • = P(X)
    • Support (XY)
      • = (Number of times X and Y appears ) / N
      • = P(X  Y)
    • Confidence (X  Y )
      • = Support (XY) / Support(X)
      • = Probability (X  Y) / P(X)
      • = Conditional Probability P( Y | X)
    • Lift : an optional term
      • Measures the power of association
      • P( Y | X) / P(Y)
  • 8. The task at hand ...
    • Given a large set of transactions, we seek a procedure ( or algorithm )
      • That will discover all association rules
      • That have a minimum support of p%
      • And a minimum confidence level of q%
      • And to do so in an efficient manner
    • Algorithms
      • The Naive or Brute Force Method
        • The Improved Naive algorithm
      • The Apriori Algorithm
        • Improvements to the Apriori algorithm
      • FP ( Frequent Pattern ) Algorithm
  • 9. Let us try the Naive Algorithm manually !
    • This is the set of transaction that we have ...
      • We want to find Association Rules with
        • Minimum 50% support and
        • Minimum 75% confidence
  • 10. Itemsets & Frequencies
    • Which sets are frequent ?
      • Since we are looking for a support of 50% , we need a set to appear in 2 out of 4 transactions
        • = (# of times X appears ) / N
        • = P(X)
      • 6 sets meet this criteria
  • 11. A closer look at the “Frequent Set”
    • Look at itemsets with more than 1 item
      • {Bread, Cheese}, {Cheese, Juice}
      • 4 rules are possible
    • Look for confidence levels
      • Confidence (X  Y )
      • = Support (XY) / Support(X)
  • 12. A closer look at the “Frequent Set”
    • Look at itemsets with more than 1 item
      • {Bread, Cheese}, {Cheese, Juice}
      • 4 rules are possible
    • Look for confidence levels
      • Confidence (X  Y )
      • = Support (XY) / Support(X)
  • 13. The Big Picture
    • List all itemsets
      • Find frequency of each
    • Identify “frequent sets”
      • Based on support
    • Search for Rules within “frequent sets”
      • Based on confidence
  • 14. Looking Beyond the Retail Store
    • Counter Terrorism
      • Track phone calls made or received from a particular number every day
      • Is an incoming call from a particular number followed by a call to another number ?
      • Are there any sets of numbers that are always called together ?
    • Expand the item sets to include
      • Electronic fund transfers
      • Travel between two locations
        • Boarding cards
        • Railway reservation
    All data is available in electronic format
  • 15. Major Problem
    • Exponential Growth of number of Itemsets
      • 4 items : 16 = 2 4 members
      • n items : 2 n members
      • As n becomes larger, the problem cannot be solved anymore in finite time
    • All attempts are made to reduce the number of Item sets to be processed
    • “ Improved” Naive algorithm
      • Ignore sets with zero frequency
  • 16. The APriori Algorithm
    • Consists of two PARTS
      • First find the frequent itemsets
        • Most of the cleverness happens here
        • We will do better than the naive algorithm
      • Find the rules
        • This is relatively simpler
  • 17. APriori : Part 1 - Frequent Sets
    • Step 1
      • Scan all transactions and find all frequent items that have support above p%. This is set L 1
    • Step 2 : Apriori-Gen
      • Build potential sets of k items from the L k-1 by using pairs of itemsets in L k-1 that has the first k-2 items common and one remaining item from each member of the pair.
      • This is Candidate set C K
    • Step 3 : Find Frequent Item Sets again
      • Scan all transactions and find frequency of sets in C K that are frequent : This gives L K
      • If L K is empty, stop, else go back to step 2
  • 18. APriori : Part 1 - Frequent Sets
    • Step 1
      • Scan all transactions and find all frequent items that have support above p% - This is set L 1
  • 19. Example
    • We have 16 items spread over 25 transactions
  • 20. Apriori : Step 1 – Computing L 1
    • Count frequency for each item and exclude those that are below minimum support
    25% support This is set L 1
  • 21. APriori : Part 1 - Frequent Sets
    • Step 1
      • Scan all transactions and find all frequent items that have support above p%. This is set L 1
    • Step 2 : Apriori-Gen
      • Build potential sets of k items from the L k-1 by using pairs of itemsets in L k-1 that has the first k-2 items common and one remaining item from each member of the pair.
      • This is Candidate set C K
  • 22. Step 2 : Computing C 2
      • Given L 1 , we now form candidate pairs of C 2 . The 7 items in form 21 pairs : d*(d-1)/2 – this is a quadratic function and not a exponential function.
    L 1 to C 2
  • 23. APriori : Part 1 - Frequent Sets
    • Step 1
      • Scan all transactions and find all frequent items that have support above p%. This is set L 1
    • Step 2 : Apriori-Gen
      • Build potential sets of k items from the L k-1 by using pairs of itemsets in L k-1 that has the first k-2 items common and one remaining item from each member of the pair.
      • This is Candidate set C K
    • Step 3 : Find Frequent Item Sets again
      • Scan all transactions and find frequency of sets in C K that are frequent : This gives L K
      • If L K is empty, stop, else go back to step 2
  • 24. From C 2 to L 2 based on minimum support
      • This is a computationally intensive step
      • L 2 is not empty
    25% support This is set L 2
  • 25. APriori : Part 1 - Frequent Sets
    • Step 1
      • Scan all transactions and find all frequent items that have support above p%. This is set L 1
    • Step 2 : Apriori-Gen
      • Build potential sets of k items from the L k-1 by using pairs of itemsets in L k-1 that has the first k-2 items common and one remaining item from each member of the pair.
      • This is Candidate set C K
    • Step 3 : Find Frequent Item Sets again
      • Scan all transactions and find frequency of sets in C K that are frequent : This gives L K
      • If L K is empty, stop, else go back to step 2
  • 26. Step 2 Again : Get C 3
      • We combine the appropriate frequent 2-item sets from L 2 (which must have the same first item) and obtain four such itemsets each containing three items
    This is set L 2 L 2 to C 3
  • 27. Step 3 Again C 3 to L 3
    • Again Based on Minimum Support
      • Since C4 cannot be formed, L4 cannot be formed so we stop here
    25% support
  • 28. APriori : Part 1 - Frequent Sets
    • Step 1
      • Scan all transactions and find all frequent items that have support above p%. This is set L 1
    • Step 2 : Apriori-Gen
      • Build potential sets of k items from the L k-1 by using pairs of itemsets in L k-1 that has the first k-2 items common and one remaining item from each member of the pair.
      • This is Candidate set C K
    • Step 3 : Find Frequent Item Sets again
      • Scan all transactions and find frequency of sets in C K that are frequent : This gives L K
      • If L K is empty, stop, else go back to step 2
  • 29. The APriori Algorithm
    • Consists of two PARTS
      • First find the frequent itemsets
        • Most of the cleverness happens here
        • We will do better than the naive algorithm
      • Find the rules
        • This is relatively simpler
  • 30. APriori : Part 2 – Find Rules
    • Rules will be found by looking at
      • 3-item sets found in L3
      • 2-item sets in L2 that are not subsets of L3
    • In each case we
      • Calculate confidence (A  B )
        • = P (B | A) = P(A  B ) / P(A)
    • Some short hand
      • {Bread, Cheese, Coffee } is written as { B, C, D}
  • 31. Rules for Finding Rules !
    • A 3 item frequent set { BCD} results in 6 rules
      • B  CD, C  BD, D  BC
      • CD  B, BD  C, BC  D
    • Also note that
      • B  CD can also be written as
        • B  D, B  C
    • We now look at these two 3-item sets and find their confidence levels
      • { Bread, Cheese, Coffee}
      • { Chocolate, Donuts, Juice }
      • From the L 3 set ( the highest L set ) and note that support for these rules is 8 and 7
  • 32. Rules from First of 2 Itemsets in L 3
    • One rule drops out because confidence < 70%
      • Calculate confidence (X  Y )
        • = P (Y | X) = P(X  Y ) / P(X)
  • 33. Rules from First of 2 Itemsets in L 3
    • One rule drops out because confidence < 70%
  • 34. Rules from Second of 2 Itemsets in L 3
    • One rule drops out because confidence < 70%
  • 35. Rules from Second of 2 Itemsets in L 3
    • One rule drops out because confidence < 70%
  • 36. Set of 14 Rules obtained from L 3
  • 37. What about L 2 ?
    • Look for sets in L 2 that are not subsets of L 3
      • { Bread, Cereal} is the only candidate
      • Which gives are two more rules
        • Bread  Cereal
        • Cereal  Bread
  • 38. Which are now added to get 16 rules
  • 39. So where are we ?
    • Apriori Algorithm Consists of two PARTS
      • First find the frequent itemsets
        • Most of the cleverness happens here
        • We will do better than the naive algorithm
      • Find the rules
        • This is relatively simpler
    • We have just completed the two PARTS
    • Overall approach to ARM is as follows
      • List all itemsets
        • Find frequency of each
      • Identify “frequent sets”
        • Based on support
      • Search for Rules within “frequent sets”
        • Based on confidence
    • Naive Algorithm
      • Exponential Time
    • A Priori Algoritm
      • Polynomial Time
  • 40. Observations
    • Actual values of support and confidence
      • 25%, 75% are very high values
      • In reality one works with far smaller values
    • “ Interestingness” of a rule
      • Since X, Y are related events – not independent – hence P(X  Y)  P(X)P(Y)
      • Interestingness  P(X  Y) – P(X)P(Y)
    • Triviality of rules
      • Rules involving very frequent items can be trivial
      • You always buy potatoes when you go to the market and so you can get rules that connect potatoes to many things
    • Inexplicable rules
      • Toothbrush was the most frequent item on Tuesday ??
  • 41. Better Algorithms
    • Enhancements to the Apriori Algorithm
      • AP-TID
      • Direct Hashing and Pruning (DHP)
      • Dynamic Itemset Counting (DIC)
    • Frequent Pattern (FP) Tree
      • Only frequent items are needed to find association rules – so ignore others !
      • Move the data of only frequent items to a more compact and efficient structure
        • A Tree structure or a directed graph is used
      • Multiple transactions with same (frequent) items are stored once with a count information
  • 42. Software Support
    • KDNuggets.com
      • Excellent collections of software available
    • Bart Goethals
      • Free software for Apriori, FP-Tree
    • ARMiner
      • GNU Open Source software from UMass/Boston
    • DMII
      • National University of Singapore
    • DB2 Intelligent Data Miner
      • IBM Corporation
      • Equivalent software available from other vendors as well