Unsupervised Learning: Clustering


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Unsupervised Learning: Clustering

  1. 1. Unsupervised Learning <ul><li>Clustering </li></ul><ul><ul><li>Unsupervised classification, that is, without the class attribute </li></ul></ul><ul><ul><li>Want to discover the classes </li></ul></ul><ul><li>Association Rule Discovery </li></ul><ul><ul><li>Discover correlation </li></ul></ul>
  2. 2. The Clustering Process <ul><li>Pattern representation </li></ul><ul><li>Definition of pattern proximity measure </li></ul><ul><li>Clustering </li></ul><ul><li>Data abstraction </li></ul><ul><li>Cluster validation </li></ul>
  3. 3. Pattern Representation <ul><li>Number of classes </li></ul><ul><li>Number of available patterns </li></ul><ul><ul><li>Circles, ellipses, squares, etc. </li></ul></ul><ul><li>Feature selection </li></ul><ul><ul><li>Can we use wrappers and filters? </li></ul></ul><ul><li>Feature extraction </li></ul><ul><ul><li>Produce new features </li></ul></ul><ul><ul><li>E.g., principle component analysis (PCA) </li></ul></ul>
  4. 4. Pattern Proximity <ul><li>Want clusters of instances that are similar to each other but dissimilar to others </li></ul><ul><li>Need a similarity measure </li></ul><ul><li>Continuous case </li></ul><ul><ul><li>Euclidean measure (compact isolated clusters) </li></ul></ul><ul><ul><li>The squared Mahalanobis distance </li></ul></ul><ul><ul><li>alleviates problems with correlation </li></ul></ul><ul><ul><li>Many more measures </li></ul></ul>
  5. 5. Pattern Proximity <ul><li>Nominal attributes </li></ul>
  6. 6. Clustering Techniques Clustering Hierarchical Partitional Single Link Complete Link Square Error Mixture Maximization K-means Expectation Maximization CobWeb
  7. 7. Technique Characteristics <ul><li>Agglomerative vs Divisive </li></ul><ul><ul><li>Agglomerative : each instance is its own cluster and the algorithm merges clusters </li></ul></ul><ul><ul><li>Divisive : begins with all instances in one cluster and divides it up </li></ul></ul><ul><li>Hard vs Fuzzy </li></ul><ul><ul><li>Hard clustering assigns each instance to one cluster whereas in fuzzy clustering assigns degree of membership </li></ul></ul>
  8. 8. More Characteristics <ul><li>Monothetic vs Polythetic </li></ul><ul><ul><li>Polythetic : all attributes are used simultaneously, e.g., to calculate distance (most algorithms) </li></ul></ul><ul><ul><li>Monothetic : attributes are considered one at a time </li></ul></ul><ul><li>Incremental vs Non-Incremental </li></ul><ul><ul><li>With large data sets it may be necessary to consider only part of the data at a time (data mining) </li></ul></ul><ul><ul><li>Incremental works instance by instance </li></ul></ul>
  9. 9. Hierarchical Clustering A C B D E F G A B C D E F G Dendrogram S imilarity
  10. 10. Hierarchical Algorithms <ul><li>Single-link </li></ul><ul><ul><li>Distance between two clusters set equal to the minimum of distances between all instances </li></ul></ul><ul><ul><li>More versatile </li></ul></ul><ul><ul><li>Produces (sometimes too) elongated clusters </li></ul></ul><ul><li>Complete-link </li></ul><ul><ul><li>Distance between two clusters set equal to maximum of all distances between instances in the clusters </li></ul></ul><ul><ul><li>Tightly bound, compact clusters </li></ul></ul><ul><ul><li>Often more useful in practice </li></ul></ul>
  11. 11. Example: Clusters Found Single-Link Complete-Link 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 * * * * * * * * * 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 * * * * * * * * *
  12. 12. Partitional Clustering <ul><li>Output a single partition of the data into clusters </li></ul><ul><li>Good for large data sets </li></ul><ul><li>Determining the number of clusters is a major challenge </li></ul>
  13. 13. K-Means Seeds Predetermined number of clusters Start with seed clusters of one element
  14. 14. Assign Instances to Clusters
  15. 15. Find New Centroids
  16. 16. New Clusters
  17. 17. Discussion: k-means <ul><li>Applicable to fairly large data sets </li></ul><ul><li>Sensitive to initial centers </li></ul><ul><ul><li>Use other heuristics to find good initial centers </li></ul></ul><ul><li>Converges to a local optimum </li></ul><ul><li>Specifying the number of centers very subjective </li></ul>
  18. 18. Clustering in Weka <ul><li>Clustering algorithms in Weka </li></ul><ul><ul><li>K-Means </li></ul></ul><ul><ul><li>Expectation Maximization (EM) </li></ul></ul><ul><ul><li>Cobweb </li></ul></ul><ul><ul><ul><li>hierarchical, incremental, and agglomerative </li></ul></ul></ul>
  19. 19. CobWeb <ul><li>Algorithm (main) characteristics: </li></ul><ul><ul><li>Hierarchical and incremental </li></ul></ul><ul><ul><li>Uses category utility </li></ul></ul><ul><li>Why divide by k ? </li></ul>The k clusters All possible values for attribute a i
  20. 20. Category Utility <ul><li>If each instance in its own cluster </li></ul><ul><li>Category utility function becomes </li></ul><ul><li>Without k it would always be best for each instance to have its own cluster, overfitting ! </li></ul>
  21. 21. The Weather Problem
  22. 22. Weather Data ( without Play ) <ul><li>Label instances: a,b,….,n </li></ul>a Start by putting the first instance in its own cluster a b Add another instance in its own cluster
  23. 23. Adding the Third Instance a c b a c b a c b Evaluate the category utility of adding the instance to one of the two clusters versus adding it as its own cluster Highest utility
  24. 24. Adding Instance f a b c d e f First instance not to get its own cluster: Look at the instances: Rainy Cool Normal FALSE Rainy Cool Normal TRUE Quite similar!
  25. 25. Add Instance g a b c d e f Look at the instances: E) Rainy Cool Normal FALSE F) Rainy Cool Normal TRUE G) Overcast Cool Normal TRUE g
  26. 26. Add Instance h b c d e f Look at the instances: A) Sunny Hot High FALSE D) Rainy Mild High FALSE H) Sunny Mild High FALSE g a h Rearrange: Merged into a single cluster before h is added Best matching node Runner up ( Splitting is also possible)
  27. 27. Final Hierarchy What next? b k d e i g a h c l f j m n
  28. 28. Dendrogram  Clusters What do a, b, c, d, h, k, and l have in common? b k d e i g a h c l f j m n
  29. 29. Numerical Attributes <ul><li>Assume normal distribution </li></ul><ul><li>Problems with zero variance! </li></ul><ul><li>The acuity parameter imposes a minimum variance </li></ul>
  30. 30. Hierarchy Size (Scalability) <ul><li>May create very large hierarchy </li></ul><ul><ul><li>The cutoff parameter is uses to suppress growth </li></ul></ul><ul><ul><li>If </li></ul></ul><ul><ul><li>cut node off. </li></ul></ul>
  31. 31. Discussion <ul><li>Advantages </li></ul><ul><ul><li>Incremental  scales to large number of instances </li></ul></ul><ul><ul><li>Cutoff  limits size of hierarchy </li></ul></ul><ul><ul><li>Handles mixed attributes </li></ul></ul><ul><li>Disadvantages </li></ul><ul><ul><li>Incremental  sensitive to order of instances? </li></ul></ul><ul><ul><li>Arbitrary choice of parameters: </li></ul></ul><ul><ul><ul><li>divide by k , </li></ul></ul></ul><ul><ul><ul><li>artificial minimum value for variance of numeric attributes, </li></ul></ul></ul><ul><ul><ul><li>ad hoc cutoff value </li></ul></ul></ul>
  32. 32. Probabilistic Perspective <ul><li>Most likely set of clusters given data </li></ul><ul><li>Probability of each instance belonging to a cluster </li></ul><ul><li>Assumption: instances are drawn from one of several distributions </li></ul><ul><li>Goal: estimate the parameters of these distributions </li></ul><ul><li>Usually: assume distributions are normal </li></ul>
  33. 33. Mixture Resolution <ul><li>Mixture : set of k probability distributions </li></ul><ul><li>Represent the k clusters </li></ul><ul><li>Probabilities that an instance takes certain attribute values given it is in the cluster </li></ul><ul><li>What is the probability an instance belongs to a cluster (or a distribution) </li></ul>
  34. 34. One Numeric Attribute Given some data, how can you determine the parameters: Two cluster mixture model: Cluster A Cluster B Attribute
  35. 35. Problems <ul><li>If we knew which instance came from each cluster we could estimate these values </li></ul><ul><li>If we knew the parameters we could calculate the probability that an instance belongs to each cluster </li></ul>
  36. 36. EM Algorithm <ul><li>Expectation Maximization (EM) </li></ul><ul><ul><li>Start with initial values for the parameters </li></ul></ul><ul><ul><li>Calculate the cluster probabilities for each instance </li></ul></ul><ul><ul><li>Re-estimate the values for the parameters </li></ul></ul><ul><ul><li>Repeat </li></ul></ul><ul><li>General purpose maximum likelihood estimate algorithm for missing data </li></ul><ul><ul><li>Can also be used to train Bayesian networks (later) </li></ul></ul>
  37. 37. Beyond Normal Models <ul><li>More than one class: </li></ul><ul><ul><li>Straightforward </li></ul></ul><ul><li>More than one numeric attribute </li></ul><ul><ul><li>Easy if assume attributes independent </li></ul></ul><ul><ul><li>If dependent attributes, treat them jointly using the bivariate normal </li></ul></ul><ul><li>Nominal attributes </li></ul><ul><ul><li>No more normal distribution! </li></ul></ul>
  38. 38. EM using Weka <ul><li>Options </li></ul><ul><ul><li>numClusters : set number of clusters. </li></ul></ul><ul><ul><ul><li>Default = -1 selects it automatically </li></ul></ul></ul><ul><ul><li>maxIterations: maximum number of iterations </li></ul></ul><ul><ul><li>seed -- random number seed </li></ul></ul><ul><ul><li>minStdDev -- set minimum allowable standard deviation </li></ul></ul>
  39. 39. Other Clustering <ul><li>Artificial Neural Networks (ANN) </li></ul><ul><li>Random search </li></ul><ul><ul><li>Genetic Algorithms (GA) </li></ul></ul><ul><ul><ul><li>GA used to find initial centroids for k -means </li></ul></ul></ul><ul><ul><li>Simulated Annealing (SA) </li></ul></ul><ul><ul><li>Tabu Search (TS) </li></ul></ul><ul><li>Support Vector Machines (SVM) </li></ul><ul><li>Will discuss GA and SVM later </li></ul>
  40. 40. Applications <ul><li>Image segmentation </li></ul><ul><li>Object and Character Recognition </li></ul><ul><li>Data Mining: </li></ul><ul><ul><li>Stand-alone to gain insight into the data </li></ul></ul><ul><ul><li>Preprocess before classification that operates on the detected clusters </li></ul></ul>
  41. 41. DM Clustering Challenges <ul><li>Data mining deals with large databases </li></ul><ul><li>Scalability with respect to number of instance </li></ul><ul><ul><li>Use a random sample (possible bias) </li></ul></ul><ul><li>Dealing with mixed data </li></ul><ul><ul><li>Many algorithms only make sense for numeric data </li></ul></ul><ul><li>High dimensional problems </li></ul><ul><ul><li>Can the algorithm handle many attributes? </li></ul></ul><ul><ul><li>How do we interpret a cluster in high dimensions? </li></ul></ul>
  42. 42. Other (General) Challenges <ul><li>Shape of clusters </li></ul><ul><li>Minimum domain knowledge (e.g., knowing the number of clusters) </li></ul><ul><li>Noisy data </li></ul><ul><li>Insensitivity to instance order </li></ul><ul><li>Interpretability and usability </li></ul>
  43. 43. Clustering for DM <ul><li>Main issue is scalability to large databases </li></ul><ul><li>Many algorithms have been developed for scalable clustering: </li></ul><ul><ul><li>Partitional methods: CLARA, CLARANS </li></ul></ul><ul><ul><li>Hierarchical methods: AGNES, DIANA, BIRCH, CURE, Chameleon </li></ul></ul>
  44. 44. Practical Partitional Clustering Algorithms <ul><li>Classic k -Means (1967) </li></ul><ul><li>Work from 1990 and later: </li></ul><ul><li>k -Medoids </li></ul><ul><ul><li>Uses the mediod instead of the centroid </li></ul></ul><ul><ul><li>Less sensitive to outliers and noise </li></ul></ul><ul><ul><li>Computations more costly </li></ul></ul><ul><ul><li>PAM (Partitioning Around Mediods) algorithm </li></ul></ul>
  45. 45. Large-Scale Problems <ul><li>CLARA: Clustering LARge Applications </li></ul><ul><ul><li>Select several random samples of instances </li></ul></ul><ul><ul><li>Apply PAM to each </li></ul></ul><ul><ul><li>Return the best clusters </li></ul></ul><ul><li>CLARANS: </li></ul><ul><ul><li>Similar to CLARA </li></ul></ul><ul><ul><li>Draws samples randomly while searching </li></ul></ul><ul><ul><li>More effective than PAM and CLARA </li></ul></ul>
  46. 46. Hierarchical Methods <ul><li>BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies </li></ul><ul><ul><li>Clustering feature : triplet summarizing information about subclusters </li></ul></ul><ul><ul><li>Clustering feature tree : height-balanced tree that stores the clustering features </li></ul></ul>
  47. 47. BIRCH Mechanism <ul><li>Phase I: </li></ul><ul><ul><li>Scan database to build an initial CF tree </li></ul></ul><ul><ul><li>Multilevel compression of the data </li></ul></ul><ul><li>Phase II: </li></ul><ul><ul><li>Apply a selected clustering algorithm to the leaf nodes of the CF tree </li></ul></ul><ul><li>Has been found to be very scalable </li></ul>
  48. 48. Conclusion <ul><li>The use of clustering in data mining practice seems to be somewhat limited due to scalability problems </li></ul><ul><li>More commonly used unsupervised learning: </li></ul><ul><li>Association Rule Discovery </li></ul>
  49. 49. Association Rule Discovery <ul><li>Aims to discovery interesting correlation or other relationships in large databases </li></ul><ul><li>Finds a rule of the form </li></ul><ul><li>if A and B then C and D </li></ul><ul><li>Which attributes will be included in the relation is unknown </li></ul>
  50. 50. Mining Association Rules <ul><li>Similar to classification rules </li></ul><ul><li>Use same procedure? </li></ul><ul><ul><li>Every attribute is the same </li></ul></ul><ul><ul><li>Apply to every possible expression on right hand side </li></ul></ul><ul><ul><li>Huge number of rules  Infeasible </li></ul></ul><ul><li>Only want rules with high coverage /support </li></ul>
  51. 51. Market Basket Analysis <ul><li>Basket data: items purchased on per-transaction basis (not cumulative, etc) </li></ul><ul><ul><li>How do you boost the sales of a given product? </li></ul></ul><ul><ul><li>What other products does discontinuing a product impact? </li></ul></ul><ul><ul><li>Which products should be shelved together? </li></ul></ul><ul><li>Terminology (market basket analysis): </li></ul><ul><ul><li>Item - an attribute/value pair </li></ul></ul><ul><ul><li>Item set - combination of items with min. coverage </li></ul></ul>
  52. 52. How Many k-Item Sets Have Minimum Coverage?
  53. 53. Item Sets
  54. 54. From Sets to Rules 3-Item Set w/coverage 4: Humidity = normal, windy = false, play = yes Association Rules: Accuracy If humidity = normal and windy = false then play = yes 4/4 If humidity = normal and play = yes then windy = false 4/6 If windy = false and play = yes then humidity = normal 4/6 If humidity = normal then windy = false and play = yes 4/7 If windy = false then humidity = normal and play = yes 4/8 If play = yes then humidity = normal and windy = false 4/9 If - then humidity = normal and windy = false and play=yes 4/12
  55. 55. From Sets to Rules (continued) 4-Item Set w/coverage 2: Temperature = cool, humidity = normal, windy = false, play = yes Association Rules: Accuracy If temperature = cool, windy = false  humidity = normal, play = yes 2/2 If temperature = cool, humidity = normal, windy = false  play = yes 2/2 If temperature = cool, windy = false, play = yes  humidity = normal 2/2
  56. 56. Overall <ul><li>Minimum coverage (2): </li></ul><ul><ul><li>12 1-item sets, 47 2-item sets, 39 3-item sets, 6 4-item sets </li></ul></ul><ul><li>Minimum accuracy (100%): </li></ul><ul><ul><li>58 association rules </li></ul></ul>“ Best” Rules (Coverage = 4, Accuracy = 100%) If humidity = normal and windy = false  play = yes If temperature = cool  humidity = normal If outlook = overcast  play = yes
  57. 57. Association Rule Mining <ul><li>STEP 1: Find all item sets that meet minimum coverage </li></ul><ul><li>STEP 2: Find all rules that meet minimum accuracy </li></ul><ul><li>STEP 3: Prune </li></ul>
  58. 58. Generating Item Sets <ul><li>How do we generate minimum coverage item sets in a scalable manner? </li></ul><ul><ul><li>Total number of item set is huge </li></ul></ul><ul><ul><li>Grows exponentially in the number of attributes </li></ul></ul><ul><li>Need an efficient algorithm: </li></ul><ul><ul><li>Start by generating minimum coverage 1-item sets </li></ul></ul><ul><ul><li>Use those to generate 2-item sets, etc </li></ul></ul><ul><li>Why do we only need to consider minimum coverage 1-item sets? </li></ul>
  59. 59. Justification Item Set 1: {Humidity = high} Coverage (1) = Number of times humidity is high Item Set 2: {Windy = false} Coverage (2) = Number of times windy is false Item Set 3: {Humidity = high, Windy = false} Coverage (3) = Number of times humidity is high and windy is false Coverage (3)  Coverage (1) If Item Set 1 and 2 do not Coverage (3)  Coverage (2) both meet min. coverage Item Set 3 cannot either
  60. 60. Generating Item Sets (A B C) (A B D) (A C D) (A C E) { Start with all 3-item sets that meet min. coverage } Candidate 4-item sets with minimum coverage (must be checked) (A B C D) (A C D E) (Consider only sets that start with the same two attributes) Merge to generate 4-item sets There are only two 4-item sets that could possibly work
  61. 61. Algorithm for Generating Item Sets <ul><li>Build up from 1-item sets so that we only consider item sets that is found by merging two minimum coverage sets </li></ul><ul><li>Only consider sets that have all but one item in common </li></ul><ul><li>Computational efficiency further improved using hash tables </li></ul>
  62. 62. Generating Rules If windy = false and play = no then outlook = sunny and humidity = high If windy = false and play = no then outlook = sunny If windy = false and play = no then humidity = high  { Meets min. coverage and accuracy { Meets min. coverage and accuracy
  63. 63. How Many Rules? <ul><li>Want to consider every possible subset of attributes as consequent </li></ul><ul><li>Have 4 attributes: </li></ul><ul><ul><li>Four single consequent rules </li></ul></ul><ul><ul><li>Six double consequent rules </li></ul></ul><ul><ul><li>Two triple consequent rules </li></ul></ul><ul><ul><li>Twelve possible rules for single 4-item set! </li></ul></ul><ul><li>Exponential explosion of possible rules </li></ul>
  64. 64. Must We Check All? If A and B then C and D If A,B and C then D
  65. 65. Efficiency Improvement <ul><li>A double consequent rule can only be OK if both single consequent rules are OK </li></ul><ul><li>Procedure: </li></ul><ul><ul><li>Start with single consequent rules </li></ul></ul><ul><ul><li>Build up double consequent rules, etc. </li></ul></ul><ul><ul><ul><li>candidate rules </li></ul></ul></ul><ul><ul><ul><li>check for accuracy </li></ul></ul></ul><ul><li>In practice: need to check far fewer rules </li></ul>
  66. 66. Apriori Algorithm <ul><li>This is a simplified description of the Apriori algorithm </li></ul><ul><li>Developed in early 90s and is the most commonly used approach </li></ul><ul><li>New developments focus on </li></ul><ul><ul><li>Generating item sets more efficiently </li></ul></ul><ul><ul><li>Generating rules from item sets more efficiently </li></ul></ul>
  67. 67. Association Rule Discovery using Weka <ul><li>Parameters to be specified in Apriori: </li></ul><ul><ul><li>upperBoundMinSupport : start with this value of minimum support </li></ul></ul><ul><ul><li>delta : in each step decrease the minimum support required by this value </li></ul></ul><ul><ul><li>lowerBoundMinSupport : final minimum support </li></ul></ul><ul><ul><li>numRules : how many rules are generated </li></ul></ul><ul><ul><li>metricType : confidence, lift, leverage, conviction </li></ul></ul><ul><ul><li>minMetric : smallest acceptable value for a rule </li></ul></ul><ul><li>Handles only nominal attributes </li></ul>
  68. 68. Difficulties <ul><li>Apriori algorithm improves performance by using candidate item sets </li></ul><ul><li>Still some problems … </li></ul><ul><ul><li>Costly to generate large number of item sets </li></ul></ul><ul><ul><ul><li>To generate a frequent pattern of size 100 need >2 100  10 30 candidates! </li></ul></ul></ul><ul><ul><li>Requires repeated scans of database to check candidates </li></ul></ul><ul><ul><ul><li>Again, most problematic for long patterns </li></ul></ul></ul>
  69. 69. Solution? <ul><li>Can candidate generation be avoided? </li></ul><ul><li>New approach: </li></ul><ul><ul><li>Create a frequent pattern tree (FP-tree) </li></ul></ul><ul><ul><ul><li>stores information on frequent patterns </li></ul></ul></ul><ul><ul><li>Use the FP-tree for mining frequent patterns </li></ul></ul><ul><ul><ul><li>partitioning-based </li></ul></ul></ul><ul><ul><ul><li>divide-and-conquer </li></ul></ul></ul><ul><ul><ul><li>(as opposed to bottom-up generation) </li></ul></ul></ul>
  70. 70. Database  FP-Tree (Min. support = 3) Item F C A B M P F:4 C:3 A:3 M:2 P:2 B:1 M:1 B:1 B:1 C:1 P:1 Root Head of node links
  71. 71. Computational Effort <ul><li>Each node has three fields </li></ul><ul><ul><li>item name </li></ul></ul><ul><ul><li>count </li></ul></ul><ul><ul><li>node link </li></ul></ul><ul><li>Also a header table with </li></ul><ul><ul><li>item name </li></ul></ul><ul><ul><li>head of node link </li></ul></ul><ul><li>Need two scans of the database </li></ul><ul><ul><li>Collect set of frequent items </li></ul></ul><ul><ul><li>Construct the FP-tree </li></ul></ul>
  72. 72. Comments <ul><li>The FP-tree is a compact data structure </li></ul><ul><li>The FP-tree contains all the information related to mining frequent patterns (given the support) </li></ul><ul><li>The size of the tree is bounded by the occurrences of frequent items </li></ul><ul><li>The height of the tree is bounded by the maximum number of items in a transaction </li></ul>
  73. 73. Mining Patterns <ul><li>Mine complete set of frequent patterns </li></ul><ul><li>For any frequent item A, all possible patterns containing A can be obtained by following A’s node links starting from A’s head of node links </li></ul>
  74. 74. Example Item F C A B M P P:2 B:1 M:1 B:1 B:1 C:1 P:1 Root Head of node links Frequent Pattern (P:3) Paths <F:4, C:3, A:3, M:2, P:2> <C:1, B:1, P:1> Occurs twice Occurs ones F:4 C:3 A:3 M:2
  75. 75. Rule Generation <ul><li>Mining complete set of association rules has some problems </li></ul><ul><ul><li>May be a large number of frequent item sets </li></ul></ul><ul><ul><li>May be a huge number of association rules </li></ul></ul><ul><li>One potential solution is to look at closed item sets only </li></ul>
  76. 76. Frequent Closed Item Sets <ul><li>An item set X is a closed item set if there is no item set X’ such that X  X’ and every transaction containing X also contains X’ </li></ul><ul><li>A rule X  Y is an association rule on a frequent closed item set if </li></ul><ul><ul><li>both X and X  Y are frequent closed item sets, and </li></ul></ul><ul><ul><li>there does not exist a frequent closed item set Z such that X  Z  X  Y </li></ul></ul>
  77. 77. Example Frequent Item Sets (min support = 2): A (3), E (4), AE (2), ACDF (2), CF (3), CEF (3), D (2), AC (2), + 12 more Not closed! Why? All the closed sets
  78. 78. Mining Frequent Closed Item Sets (CLOSET) TDB CEFAD EA CEF CFAD CEF D-cond DB (D:2) CEFA CFA A-cond DB (A:3) CEF E CF F-cond DB (F:4) CE:3 C E-cond DB (E:4) C:4 Output: CFAD:2 Output: A:3 Output: CF:2,CEF:3 Output: E:4 EA-cond DB (EA:2) C Output: EA:2 NOTE C:4 E:4 F:4 A:3 Order for D:2 conditional DB
  79. 79. Mining with Taxonomies Clothes Outerwear Shirts Jackets Ski Pants Footwear Shoes Hiking Boots Taxonomy: Generalized association rule X  Y where no item in Y is an ancestor of an item in X
  80. 80. Why Taxonomy? <ul><li>The ‘classic’ association rule mining restricts the rules to the leave nodes in the taxonomy </li></ul><ul><li>However: </li></ul><ul><ul><li>Rules at lower levels may not have minimum support and thus interesting association may go undiscovered </li></ul></ul><ul><ul><li>Taxonomies can be used to prune uninteresting and redundant rules </li></ul></ul>
  81. 81. Example
  82. 82. Interesting Rules <ul><li>Many way in which the interestingness of a rule can be evaluated based on ancestors </li></ul><ul><li>For example: </li></ul><ul><ul><li>A rule with no ancestors is interesting </li></ul></ul><ul><ul><li>A rule with ancestor(s) is interesting only if it has enough ‘relative support’ </li></ul></ul><ul><ul><li>Which rules are interesting? </li></ul></ul>
  83. 83. Discussion <ul><li>Association rule mining finds expression of the form X  Y from large data sets </li></ul><ul><li>One of the most popular data mining tasks </li></ul><ul><li>Originates in market basket analysis </li></ul><ul><li>Key measures of performance </li></ul><ul><ul><li>Support </li></ul></ul><ul><ul><li>Confidence (or accuracy) </li></ul></ul><ul><li>Is support and confidence enough? </li></ul>
  84. 84. Type of Rules Discovered <ul><li>‘Classic’ association rule problem </li></ul><ul><ul><li>All rules satisfying minimum threshold of support and confidence </li></ul></ul><ul><li>Focus on subset of rules, e.g., </li></ul><ul><ul><li>Optimized rules </li></ul></ul><ul><ul><li>Maximal frequent item sets </li></ul></ul><ul><ul><li>Closed item sets </li></ul></ul>What makes for an interesting rule?
  85. 85. Algorithm Construction <ul><li>Determine frequent item sets (all or part) </li></ul><ul><ul><li>By far the most computational time </li></ul></ul><ul><ul><li>Variations focus on this part </li></ul></ul><ul><li>Generate rules from frequent item sets </li></ul>
  86. 86. Generating Item Sets Bottom-up Top-down Counting Intersecting Counting Intersecting Search space traversed Support determined Apriori* Partition FP-Growth* Eclat AprioriTID DIC Apriori-like algorithms No algorithm dominates others! * Have discussed
  87. 87. Applications <ul><li>Market basket analysis </li></ul><ul><ul><li>Classic marketing application </li></ul></ul><ul><li>Applications to recommender systems </li></ul>
  88. 88. Recommender <ul><li>Customized goods and services </li></ul><ul><li>Recommend products </li></ul><ul><li>Collaborative filtering </li></ul><ul><ul><li>similarities among users’ tastes </li></ul></ul><ul><ul><li>recommend based on other users </li></ul></ul><ul><ul><li>many on-line systems </li></ul></ul><ul><ul><li>simple algorithms </li></ul></ul>
  89. 89. Classification Approach <ul><li>View as classification problem </li></ul><ul><ul><li>Product either of interest or not </li></ul></ul><ul><ul><li>Induce a model, e.g., a decision tree </li></ul></ul><ul><ul><li>Classify a new product as either interesting or not interesting </li></ul></ul><ul><li>Difficulty in this approach? </li></ul>
  90. 90. Association Rule Approach <ul><li>Product associations </li></ul><ul><ul><li>90% of users who like product A and product B also like product C </li></ul></ul><ul><ul><li>A and B  C (90%) </li></ul></ul><ul><li>User associations </li></ul><ul><ul><li>90% of products liked by user A and user B are also liked by user C </li></ul></ul><ul><li>Use combination of product and user associations </li></ul>
  91. 91. Advantages <ul><li>‘ Classic’ collaborative filtering must identify users with similar tastes </li></ul><ul><li>This approach uses overlap of other users’ tastes to match given user’s taste </li></ul><ul><ul><li>Can be applied to users whose tastes don’t correlate strongly with those of other users </li></ul></ul><ul><ul><li>Can take advantage of information from, say user A, for a recommendation to user B, even if they do not correlate </li></ul></ul>
  92. 92. What’s Different Here? <ul><li>Is this really a ‘classic’ association rule problem? </li></ul><ul><li>Want to learn what products are liked by what users </li></ul><ul><li>‘Semi-supervised’ </li></ul><ul><li>Target item </li></ul><ul><ul><li>User (for user associations) </li></ul></ul><ul><ul><li>Product (for product associations) </li></ul></ul>
  93. 93. Single-Consequent Rules <ul><li>Only a single (target) item in the consequent </li></ul><ul><li>Go through all such items </li></ul>Classification One single item consequent Association Rules All possible item combination consequent Associations for Recommender