Data Mining


Published on

Published in: Education, Technology
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Mining

  1. 1. Data MiningRajendra Akerkar
  2. 2. What Is Data Mining?n Data mining (knowledge discovery from data) ¨ Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of datan Is everything “data mining”? ¨ (Deductive) query processing. ¨ Expert systems or small ML/statistical programsBuild computer programs that sift through databases automatically, seeking regularities or patternsJuly 7, 2009 Data Mining: R. Akerkar 2
  3. 3. Data Mining — What’s in a Name? Information Harvesting Knowledge Mining Data Mining Knowledge Discovery in Databases Data Dredging Data ArchaeologyData Pattern Processing Database Mining Knowledge Extraction Siftware The process of discovering meaningful new correlations, patterns, and trends by sifting through large amounts of stored data, using pattern recognition technologies and statistical and mathematical techniquesJuly 7, 2009 Data Mining: R. Akerkar 3
  4. 4. Definitionn Several Definitions ¨ Non-trivial extraction of implicit, previously unknown and potentially useful information from data ¨ Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns From [Fayyad,] Advances in Knowledge Discovery and Data Mining, 1996 July 7, 2009 Data Mining: R. Akerkar 4
  5. 5. What is (not) Data Mining?lWhat is not Data l What is Data Mining?Mining? – Certain names are more – Look up phone common in certain Indian number in phone states (Joshi, Patil, directory Kulkarni… in Pune area). – Group together similar – Query a Web documents returned by search engine for search engine according to information about their context (e.g. Google “Pune” Scholar,,).July 7, 2009 Data Mining: R. Akerkar 5
  6. 6. Origins of Data Miningn Draws ideas from machine learning/AI, pattern recognition, statistics, and database systemsn Traditional Techniques may be unsuitable due to Statistics/ Machine Learning/ ¨ Enormity of data AI Pattern ¨ High dimensionality Recognition of data Data Mining ¨ Heterogeneous, distributed nature of data Database systems July 7, 2009 Data Mining: R. Akerkar 6
  7. 7. Data Mining Tasksn Prediction Methods ¨ Use some variables to predict unknown or future values of other variables.n Description Methods ¨ Find human-interpretable patterns that describe the data.July 7, 2009 Data Mining: R. Akerkar 7
  8. 8. Data Mining Tasks...n Classification [Predictive] predicting an item classn Clustering [Descriptive] finding clusters in datan Association Rule Discovery [Descriptive] frequent occurring eventsn Deviation/Anomaly Detection [Predictive] finding changesJuly 7, 2009 Data Mining: R. Akerkar 8
  9. 9. Classification: Definition n Given a collection of records (training set ) ¨ Each record contains a set of attributes, one of the attributes is the class. n Find a model for class attribute as a function of the values of other attributes. n Goal: previously unseen records should be assigned a class as accurately as possible. ¨A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.July 7, 2009 Data Mining: R. Akerkar 9
  10. 10. Classification Example l l s rica rica ou go go inu te te nt s ca ca co clas Tid Refund Marital Taxable Refund Marital Taxable Status Income Cheat Status Income Cheat 1 Yes Single 125K No No Single 75K ? 2 No Married 100K No Yes Married 50K ? 3 No Single 70K No No Married 150K ? 4 Yes Married 120K No Yes Divorced 90K ? 5 No Divorced 95K Yes No Single 40K ? 6 No Married 60K No No Married 80K ? Test No 0 1 Set 7 Yes Divorced 220K 8 No Single 85K Yes 9 No Married 75K No Learn Training 10 No Single 90K Yes Classifier Model10 Set July 7, 2009 Data Mining: R. Akerkar 10
  11. 11. Classification: Application 1n Direct Marketing ¨ Goal:Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. ¨ Approach: n Use the data for a similar product introduced before. n We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute. n Collect various demographic, lifestyle, and company- interaction related information about all such customers. ¨ Type of business, where they stay, how much they earn, etc. n Use this information as input attributes to learn a classifier model. From [Berry & Linoff] Data Mining Techniques, 1997July 7, 2009 Data Mining: R. Akerkar 11
  12. 12. Classification: Application 2 n Fraud Detection ¨ Goal: Predict fraudulent cases in credit card transactions. ¨ Approach: n Use credit card transactions and the information on its account-holder as attributes. ¨ When does a customer buy, what does he buy, how often he pays on time, etc n Label past transactions as fraud or fair transactions. This forms the class attribute. n Learn a model for the class of the transactions. n Use this model to detect fraud by observing credit card transactions on an account.July 7, 2009 Data Mining: R. Akerkar 12
  13. 13. Classification: Application 3n Customer Attrition/Churn: ¨ Goal: To predict whether a customer is likely to be lost to a competitor. ¨ Approach: n Use detailed record of transactions with each of the past and present customers, to find attributes. ¨ How often the customer calls, where he calls, what time- of-the day he calls most, his financial status, marital status, etc. n Label the customers as loyal or disloyal. n Find a model for loyalty. From [Berry & Linoff] Data Mining Techniques, 1997July 7, 2009 Data Mining: R. Akerkar 13
  14. 14. Decision Tree
  15. 15. Introductionn A classification scheme which generates a tree and a set of rules from given data set.n The set of records available for developing classification methods is divided into two disjoint subsets – a training set and a test set.n The attributes of the records are categorise into two types: ¨ Attributes whose domain is numerical are called numerical attributes. ¨ Attributes whose domain is not numerical are called the categorical attributes.July 7, 2009 Data Mining: R. Akerkar 15
  16. 16. Introductionn A decision tree is a tree with the following properties: ¨ An inner node represents an attribute. ¨ An edge represents a test on the attribute of the father node. ¨ A leaf represents one of the classes.n Construction of a decision tree ¨ Based on the training data ¨ Top-Down strategyJuly 7, 2009 Data Mining: R. Akerkar 16
  17. 17. Decision TreeExamplen The data set has five attributes.n There is a special attribute: the attribute class is the class label.n The attributes, temp (temperature) and humidity are numerical attributesn Other attributes are categorical, that is, they cannot be ordered.n Based on the training data set, we want to find a set of rules to know what values of outlook, temperature, humidity and wind, determine whether or not to play golf.July 7, 2009 Data Mining: R. Akerkar 17
  18. 18. Decision Tree Examplen We have five leaf nodes.n In a decision tree, each leaf node represents a rule.n We have the following rules corresponding to the tree given in Figure.n RULE 1 If it is sunny and the humidity is not above 75%, then play.n RULE 2 If it is sunny and the humidity is above 75%, then do not play.n RULE 3 If it is overcast, then play.n RULE 4 If it is rainy and not windy, then play.n RULE 5 If it is rainy and windy, then dont play.July 7, 2009 Data Mining: R. Akerkar 18
  19. 19. Classificationn The classification of an unknown input vector is done by traversing the tree from the root node to a leaf node.n A record enters the tree at the root node.n At the root, a test is applied to determine which child node the record will encounter next.n This process is repeated until the record arrives at a leaf node.n All the records that end up at a given leaf of the tree are classified in the same way.n There is a unique path from the root to each leaf.n The path is a rule which is used to classify the records.July 7, 2009 Data Mining: R. Akerkar 19
  20. 20. n In our tree, we can carry out the classification for an unknown record as follows.n Let us assume, for the record, that we know the values of the first four attributes (but we do not know the value of class attribute) asn outlook= rain; temp = 70; humidity = 65; and windy= true.July 7, 2009 Data Mining: R. Akerkar 20
  21. 21. n We start from the root node to check the value of the attribute associated at the root node.n This attribute is the splitting attribute at this node.n For a decision tree, at every node there is an attribute associated with the node called the splitting attribute.n In our example, outlook is the splitting attribute at root.n Since for the given record, outlook = rain, we move to the right-most child node of the root.n At this node, the splitting attribute is windy and we find that for the record we want classify, windy = true.n Hence, we move to the left child node to conclude that the class label Is "no play".July 7, 2009 Data Mining: R. Akerkar 21
  22. 22. n The accuracy of the classifier is determined by the percentage of the test data set that is correctly classified. n We can see that for Rule 1 there are two records of the test data set satisfying outlook= sunny and humidity < 75, and only one of these is correctly classified as play. n Thus, the accuracy of this rule is 0.5 (or 50%). Similarly, the accuracy of Rule 2 is also 0.5 (or 50%). The accuracy of Rule 3 is 0.66.RULE 1If it is sunny and the humidityis not above 75%, then play. July 7, 2009 Data Mining: R. Akerkar 22
  23. 23. Concept of Categorical Attributesn Consider the following training data set.n There are three attributes, namely, age, pincode and class.n The attribute class is used for class label. The attribute age is a numeric attribute, whereas pincode is a categorical one. Though the domain of pincode is numeric, no ordering can be defined among pincode values. You cannot derive any useful information if one pin-code is greater than another pincode.July 7, 2009 Data Mining: R. Akerkar 23
  24. 24. n Figure gives a decision tree for the training data.n The splitting attribute at the root is pincode and the splitting criterion here is pincode = 500 046.n Similarly, for the left child node, the splitting criterion is age < 48 At root level, we have 9 records. (the splitting attribute is age). The associated splitting criterion is pincode = 500 046.n Although the right child node As a result, we split the records has the same attribute as the into two subsets. Records 1, 2, 4, 8, splitting attribute, the splitting and 9 are to the left child note and criterion is different. remaining to the right node. The process is repeated at every node.July 7, 2009 Data Mining: R. Akerkar 24
  25. 25. Advantages and Shortcomings ofDecision Tree Classificationsn A decision tree construction process is concerned with identifying the splitting attributes and splitting criterion at every level of the tree.n Major strengths are: ¨ Decision tree able to generate understandable rules. ¨ They are able to handle both numerical and categorical attributes. ¨ They provide clear indication of which fields are most important for prediction or classification.n Weaknesses are: ¨ The process of growing a decision tree is computationally expensive. At each node, each candidate splitting field is examined before its best split can be found. ¨ Some decision tree can only deal with binary-valued target classes.July 7, 2009 Data Mining: R. Akerkar 25
  26. 26. Iterative Dichotomizer (ID3)n Quinlan (1986)n Each node corresponds to a splitting attributen Each arc is a possible value of that attribute.n At each node the splitting attribute is selected to be the most informative among the attributes not yet considered in the path from the root.n Entropy is used to measure how informative is a node.n The algorithm uses the criterion of information gain to determine the goodness of a split. ¨ The attribute with the greatest information gain is taken as the splitting attribute, and the data set is split for all distinct values of the attribute.July 7, 2009 Data Mining: R. Akerkar 26
  27. 27. Training Dataset The class label attribute, This follows an example from Quinlan’s ID3 buys_computer, has two distinct values. age income student credit_rating buys_computer Thus there are two distinct <=30 high no fair no classes. (m =2) <=30 high no excellent no 31…40 high no fair yes Class C1 corresponds to yes >40 medium no fair yes and class C2 corresponds to no. >40 low yes fair yes >40 low yes excellent no There are 9 samples of class yes 31…40 low yes excellent yes and 5 samples of class no. <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent noJuly 7, 2009 Data Mining: R. Akerkar 27
  28. 28. Extracting Classification Rules from Trees n Represent the knowledge in the form of IF-THEN rules n One rule is created for each path from the root to a leaf n Each attribute-value pair along a path forms a conjunction n The leaf node holds the class prediction n Rules are easier for humans to What are the rules? understandJuly 7, 2009 Data Mining: R. Akerkar 28
  29. 29. Solution (Rules) IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”July 7, 2009 Data Mining: R. Akerkar 29
  30. 30. Algorithm for Decision Tree Inductionn Basic algorithm (a greedy algorithm) ¨ Tree is constructed in a top-down recursive divide-and-conquer manner ¨ At start, all the training examples are at the root ¨ Attributes are categorical (if continuous-valued, they are discretized in advance) ¨ Examples are partitioned recursively based on selected attributes ¨ Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)n Conditions for stopping partitioning ¨ All samples for a given node belong to the same class ¨ There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf ¨ There are no samples leftJuly 7, 2009 Data Mining: R. Akerkar 30
  31. 31. Attribute Selection Measure: Information Gain (ID3/C4.5) n Select the attribute with the highest information gain n S contains si tuples of class Ci for i = {1, …, m} n information measures info required to classify any arbitrary tuple m si si I( s1,s2,...,s m ) = − ∑ log 2 n i =1 s s ….information is encoded in bits. n entropy of attribute A with values {a1,a2,…,av} v s1 j + ...+ smj E(A)= ∑ I ( s1 j ,...,smj ) j =1 s n information gained by branching on attribute A Gain(A) = I(s 1, s 2 ,..., s m ) − E(A)July 7, 2009 Data Mining: R. Akerkar 31
  32. 32. Entropyn Entropy measures the homogeneity (purity) of a set of examples.n It gives the information content of the set in terms of the class labels of the examples.n Consider that you have a set of examples, S with two classes, P and N. Let the set have p instances for the class P and n instances for the class N.n So the total number of instances we have is t = p + n. The view [p, n] can be seen as a class distribution of S.The entropy for S is defined asn Entropy(S) = - (p/t).log2(p/t) - (n/t).log2(n/t)n Example: Let a set of examples consists of 9 instances for class positive, and 5 instances for class negative.n Answer: p = 9 and n = 5.n So Entropy(S) = - (9/14).log2(9/14) - (5/14).log2(5/14)n = -(0.64286)(-0.6375) - (0.35714)(-1.48557)n = (0.40982) + (0.53056)n = 0.940July 7, 2009 Data Mining: R. Akerkar 32
  33. 33. EntropyThe entropy for a completely pure set is 0 and is 1 for a set with equal occurrences for both the classes.i.e. Entropy[14,0] = - (14/14).log2(14/14) - (0/14).log2(0/14) = -1.log2(1) - 0.log2(0) = -1.0 - 0 =0i.e. Entropy[7,7] = - (7/14).log2(7/14) - (7/14).log2(7/14) = - (0.5).log2(0.5) - (0.5).log2(0.5) = - (0.5).(-1) - (0.5).(-1) = 0.5 + 0.5 =1July 7, 2009 Data Mining: R. Akerkar 33
  34. 34. Attribute Selection by Information Gain Computation 5 4 g Class P: buys_computer = “yes” E ( age ) = I ( 2,3) + I ( 4, 0 ) 14 14 g Class N: buys_computer = “no” 5 g I(p, n) = I(9, 5) =0.940 + I (3, 2 ) = 0 .694 g Compute the entropy for age: 14 age pi ni I(pi, ni) 5 <=30 2 3 0.971 I ( 2 ,3 ) means “age <=30” has 14 30…40 4 0 0 5 out of 14 samples, with 2 >40 3 2 0.971 yess and 3 no’s. Hence age income student credit_rating buys_computer<=30<=30 high high no no fair excellent no no Gain (age ) = I ( p, n) − E (age ) = 0.24631…40 high no fair yes>40 medium no fair yes Gain (income) = 0.029>40 low yes fair yes>40 low yes excellent no Similarly, Gain ( student ) = 0.15131…40 low yes excellent yes<=30 medium no fair no Gain (credit _ rating ) = 0.048<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes Since, age has the highest information gain among31…40 medium no excellent yes the attributes, it is selected as the test attribute.31…40 high yes fair yes>40 July medium 7, 2009 no excellent Data Mining: R. Akerkar no 34
  35. 35. Exercise 1n The following table consists of training data from an employee database.n Let status be the class attribute. Use the ID3 algorithm to construct a decision tree from the given data.July 7, 2009 Data Mining: R. Akerkar 35
  36. 36. Solution 1July 7, 2009 Data Mining: R. Akerkar 36
  37. 37. Other Attribute Selection Measures n Gini index (CART, IBM IntelligentMiner) ¨ All attributes are assumed continuous-valued ¨ Assume there exist several possible split values for each attribute ¨ May need other tools, such as clustering, to get the possible split values ¨ Can be modified for categorical attributesJuly 7, 2009 Data Mining: R. Akerkar 37
  38. 38. Gini Index (IBM IntelligentMiner)n If a data set T contains examples from n classes, gini index, n gini(T) is defined as gini ( T ) = 1 − ∑ p 2 j j =1 where pj is the relative frequency of class j in T.n If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as gini split (T ) = N 1 gini (T 1) + N 2 gini (T 2 ) N Nn The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute). July 7, 2009 Data Mining: R. Akerkar 38
  39. 39. Exercise 2July 7, 2009 Data Mining: R. Akerkar 39
  40. 40. Solution 2n SPLIT: Age <= 50n ----------------------n | High | Low | Totaln --------------------n S1 (left) | 8 | 11 | 19n S2 (right) | 11 | 10 | 21n --------------------n For S1: P(high) = 8/19 = 0.42 and P(low) = 11/19 = 0.58n For S2: P(high) = 11/21 = 0.52 and P(low) = 10/21 = 0.48n Gini(S1) = 1-[0.42x0.42 + 0.58x0.58] = 1-[0.18+0.34] = 1-0.52 = 0.48n Gini(S2) = 1-[0.52x0.52 + 0.48x0.48] = 1-[0.27+0.23] = 1-0.5 = 0.5n Gini-Split(Age<=50) = 19/40 x 0.48 + 21/40 x 0.5 = 0.23 + 0.26 = 0.49n SPLIT: Salary <= 65Kn ----------------------n | High | Low | Totaln --------------------n S1 (top) | 18 | 5 | 23n S2 (bottom) | 1 | 16 | 17n --------------------n For S1: P(high) = 18/23 = 0.78 and P(low) = 5/23 = 0.22n For S2: P(high) = 1/17 = 0.06 and P(low) = 16/17 = 0.94n Gini(S1) = 1-[0.78x0.78 + 0.22x0.22] = 1-[0.61+0.05] = 1-0.66 = 0.34n Gini(S2) = 1-[0.06x0.06 + 0.94x0.94] = 1-[0.004+0.884] = 1-0.89 = 0.11n Gini-Split(Age<=50) = 23/40 x 0.34 + 17/40 x 0.11 = 0.20 + 0.05 = 0.25July 7, 2009 Data Mining: R. Akerkar 40
  41. 41. Exercise 3n In previous exercise, which is a better split of the data among the two split points? Why?July 7, 2009 Data Mining: R. Akerkar 41
  42. 42. Solution 3n Intuitively Salary <= 65K is a better split point since it produces relatively ``pure partitions as opposed to Age <= 50, which results in more mixed partitions (i.e., just look at the distribution of Highs and Lows in S1 and S2).n More formally, let us consider the properties of the Gini index. If a partition is totally pure, i.e., has all elements from the same class, then gini(S) = 1-[1x1+0x0] = 1-1 = 0 (for two classes). On the other hand if the classes are totally mixed, i.e., both classes have equal probability then gini(S) = 1 - [0.5x0.5 + 0.5x0.5] = 1-[0.25+0.25] = 0.5. In other words the closer the gini value is to 0, the better the partition is. Since Salary has lower gini it is a better split.July 7, 2009 Data Mining: R. Akerkar 42
  43. 43. Clustering
  44. 44. Clustering: Definitionn Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that ¨ Data points in one cluster are more similar to one another. ¨ Data points in separate clusters are less similar to one another.n Similarity Measures: ¨ Euclidean Distance if attributes are continuous. ¨ Other Problem-specific Measures.July 7, 2009 Data Mining: R. Akerkar 44
  45. 45. Clustering: Illustrationx Euclidean Distance Based Clustering in 3-D space. Intracluster distances Intercluster distances are minimized are maximizedJuly 7, 2009 Data Mining: R. Akerkar 45
  46. 46. Clustering: Application 1 n Market Segmentation: ¨ Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. ¨ Approach: n Collect different attributes of customers based on their geographical and lifestyle related information. n Find clusters of similar customers. n Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.July 7, 2009 Data Mining: R. Akerkar 46
  47. 47. Clustering: Application 2 n Document Clustering: ¨ Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. ¨ Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. ¨ Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.July 7, 2009 Data Mining: R. Akerkar 47
  48. 48. k- mean
  49. 49. Clusteringn Clustering is the process of grouping data into clusters so that objects within a cluster have similarity in comparison to one another, but are very dissimilar to objects in other clusters.n The similarities are assessed based on the attributes values describing these objects.July 7, 2009 Data Mining: R. Akerkar 49
  50. 50. The K-Means Clustering Method n Given k, the k-means algorithm is implemented in four steps: ¨ Partition objects into k nonempty subsets ¨ Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) ¨ Assign each object to the cluster with the nearest seed point ¨ Go back to Step 2, stop when no more new assignmentJuly 7, 2009 Data Mining: R. Akerkar 50
  51. 51. The K-Means Clustering Method n Example 10 1010 9 99 8 88 7 77 6 66 5 55 4 44 Assign 3 Update 3 the3 each 2 221 objects 1 cluster 1 0 means 00 to most 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 similar center reassign reassign 10 10 K=2 9 9 8 8 Arbitrarily choose K 7 7 object as initial 6 6 5 5 cluster center 4 Update 4 3 2 the 3 2 1 cluster 1 0 0 1 2 3 4 5 6 7 8 9 10 means 0 0 1 2 3 4 5 6 7 8 9 10 July 7, 2009 Data Mining: R. Akerkar 51
  52. 52. K-Means Clusteringn K-means is a partition based clustering algorithm.n K-means’ goal: Partition database D into K parts, where there is little similarity across groups, but great similarity within a group. More specifically, K-means aims to minimize the mean square error of each point in a cluster, with respect to its cluster centroid.July 7, 2009 Data Mining: R. Akerkar 52
  53. 53. K-Means Example A1 2 4n Consider the following one-dimensional 10 database with attribute A1. 12 3n Let us use the k-means algorithm to partition 20 this database into k = 2 clusters. We begin by 30 choosing two random starting points, which will serve as the centroids of the two clusters. 11 25 µ C1 = 2 µ C2 = 4July 7, 2009 Data Mining: R. Akerkar 53
  54. 54. Cluster A1n To form clusters, we assign each Assignment point in the database to the C1 2 nearest centroid. C2 4n For instance, 10 is closer to c2 C2 10 than to c1. C2 12n If a point is the same distance C1 3 from two centroids, such as point C2 20 3 in our example, we make an C2 30 C2 11 arbitrary assignment. C2 25July 7, 2009 Data Mining: R. Akerkar 54
  55. 55. n Once all points have been assigned, we recompute the means of the clusters. 2+3 µ C1 = = 2 .5 2 4 + 10 + 12 + 20 + 30 + 11 + 25 112 µ C2 = = = 16 7 7July 7, 2009 Data Mining: R. Akerkar 55
  56. 56. n We then reassign each point to the two Clusters A1 clusters based on the new means. C1 2 C1 4n Remark: point 4 now belongs to cluster C2 10 C1. C2 12 C1 3n The steps are repeated until the means C2 20 converge to their optimal values. In C2 30 each iteration, the means are re- C2 11 computed and all points are reassigned. C2 25July 7, 2009 Data Mining: R. Akerkar 56
  57. 57. n In this example, only one more iteration is needed before the means converge. We compute the new means: 2+3+ 4 µ C1 = =3 3 10 + 12 + 20 + 30 + 11 + 25 108 µ C2 = = = 18 6 6 Now if we reassign the points there is no change in the clusters. Hence the means have converged to their optimal values and the algorithm terminates.July 7, 2009 Data Mining: R. Akerkar 57
  58. 58. Visualization of k-meansalgorithmJuly 7, 2009 Data Mining: R. Akerkar 58
  59. 59. Exercisen Apply the K-means algorithm for the following 1-dimensional points (for k=2): 1; 2; 3; 4; 6; 7; 8; 9.n Use 1 and 2 as the starting centroids.July 7, 2009 Data Mining: R. Akerkar 59
  60. 60. SolutionIteration #11: 1 mean = 12: 2,3,4,6,7,8,9 mean = 5.57Iteration #21: 1,2,3 mean = 25.57: 4,6,7,8,9 mean = 6.8Iteration #32: 1,2,3,4 mean = 2.56.8: 6,7,8,9 mean = 7.5Iteration #42.5: 1,2,3,4 mean = 2.57.5: 6,7,8,9 mean = 7.5Means haven’t changed, so stop iterating.The final clusters are {1,2,3,4} and {6,7,8,9}.July 7, 2009 Data Mining: R. Akerkar 60
  61. 61. K – Mean for 2-dimensionaldatabasen Let us consider {x1, x2, x3, x4, x5} with following coordinates as two-dimensional sample for clustering:n x1 = (0, 2), x2 = (0, 0), x3 = (1.5,0), x4 = (5,0), x5 = (5, 2)n Suppose that required number of clusters is 2.n Initially, clusters are formed from random distribution of samples:n C1 = {x1, x2, x4} and C2 = {x3, x5}.July 7, 2009 Data Mining: R. Akerkar 61
  62. 62. Centroid Calculationn Suppose that the given set of N samples in an n-dimensional space has somehow be partitioned into K clusters {C1, C2, …, Ck}n Each Ck has nk samples and each sample is exactly in one cluster.n Therefore, Σ nk = N, where k = 1, …, K. nkn The mean vector Mk of cluster Ck is defined as centroid of the cluster, Where xik is the ith sample belonging Σi = 1 xik to cluster Ck. Mk = (1/ nk)n In our example, The centroids for these two clusters aren M1 = {(0 + 0 + 5)/3, (2 + 0 + 0)/3} = {1.66, 0.66}n M2 = {( 1.5 + 5)/2, (0 +2)/2} = {3.25, 1.00}July 7, 2009 Data Mining: R. Akerkar 62
  63. 63. The Square-error of the clustern The square-error for cluster Ck is the sum of squared Euclidean distances between each sample in Ck and its centroid.n This error is called the within-cluster variation. nk ek2 = Σi = 1 (xik – Mk)2n Within cluster variations, after initial random distribution of samples, aren e12 = [(0 – 1.66)2 + (2 – 0.66)2] + [(0 – 1.66)2 + (0 – 0.66)2] + [(5 – 1.66)2 + (0 – 0.66)2] = 19.36n e22 = [(1.5 – 3.25)2 + (0 – 1)2] + [(5 – 3.25)2 + (2 – 1)2] = 8.12July 7, 2009 Data Mining: R. Akerkar 63
  64. 64. Total Square-errorn The square error for the entire clustering space containing K clusters is the sum of the within-cluster variations. K Ek2 = Σk = 1 ek2n The total square error is E2 = e12 + e22 = 19.36 + 8.12 = 27.48July 7, 2009 Data Mining: R. Akerkar 64
  65. 65. n When we reassign all samples, depending on a minimum distance from centroids M1 and M2, the new redistribution of samples inside clusters will be,n d(M1, x1) = (1.662 + 1.342)1/2 = 2.14 and d(M2, x1) = 3.40 ⇒ x1 ∈ C1n d(M1, x2) = 1.79 and d(M2, x2) = 3.40 ⇒ x2 ∈ C1 d(M1, x3) = 0.83 and d(M2, x3) = 2.01 ⇒ x3 ∈ C1 d(M1, x4) = 3.41 and d(M2, x4) = 2.01 ⇒ x4 ∈ C2 d(M1, x5) = 3.60 and d(M2, x5) = 2.01 ⇒ x5 ∈ C2 Above calculation is based on Euclidean distance formula, m d(xi, xj) = Σk = 1 (xik – xjk)1/2July 7, 2009 Data Mining: R. Akerkar 65
  66. 66. n New Clusters C1 = {x1, x2, x3} and C2 = {x4, x5} have new centroidsn M1 = {0.5, 0.67}n M2 = {5.0, 1.0}n The corresponding within-cluster variations and the total square error are,n e12 = 4.17n e22 = 2.00n E2 = 6.17July 7, 2009 Data Mining: R. Akerkar 66
  67. 67. The cluster membershipstabilizes…n After the first iteration, the total square error is significantly reduced: ( from 27.48 to 6.17)n In this example, if we analysis the distances between the new centroids and the samples, the second iteration will be assigned to the same clusters.n Thus no further reassignment and algorithm halts.July 7, 2009 Data Mining: R. Akerkar 67
  68. 68. Variations of the K-Means Methodn A few variants of the k-means which differ in ¨ Selection of the initial k means ¨ Strategies to calculate cluster meansn Handling categorical data: k-modes (Huang’98) ¨ Replacing means of clusters with modes ¨ Using new dissimilarity measures to deal with categorical objects ¨ Using a frequency-based method to update modes of clusters ¨ A mixture of categorical and numerical data: k-prototype methodJuly 7, 2009 Data Mining: R. Akerkar 68
  69. 69. What is the problem of k-Means Method?n The k-means algorithm is sensitive to outliers ! ¨ Since an object with an extremely large value may substantially distort the distribution of the data.n K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10July 7, 2009 Data Mining: R. Akerkar 69
  70. 70. Exercise 2n Let the set X consist of the following sample points in 2 dimensional space:n X = {(1, 2), (1.5, 2.2), (3, 2.3), (2.5,-1), (0, 1.6), (-1,1.5)}n Let c1 = (1.5, 2.5) and c2 = (3, 1) be initial estimates of centroids for X.n What are the revised values of c1 and c2 after 1 iteration of k-means clustering (k = 2)?July 7, 2009 Data Mining: R. Akerkar 70
  71. 71. Solution 2n For each data point, calculate the distance to each centroid: x y d(xi,c1) d(xi,c2)x1 1 2 0.707107 2.236068x2 1.5 2.2 0.3 1.920937x3 3 2.3 1.513275 1.3x4 2.5 -1 3.640055 2.061553x5 0 1.6 1.749286 3.059412x6 -1 1.5 2.692582 4.031129July 7, 2009 Data Mining: R. Akerkar 71
  72. 72. n It follows that x1, x2, x5 and x6 are closer to c1 and the other points are closer to c2. Hence replace c1 with the average of x1, x2, x5 and x6 and replace c2 with the average of x3 and x4. This gives:n c1’ = (0.375, 1.825)n c2’ = (2.75, 0.65)July 7, 2009 Data Mining: R. Akerkar 72
  73. 73. Association Rule DiscoveryJuly 7, 2009 Data Mining: R. Akerkar 73
  74. 74. Market-basket problem.n We are given a set of items and a large collection of transactions, which are subsets (baskets) of these items.n Task: To find relationships between the presences of various items within these baskets.n Example: To analyze customers buying habits by finding associations between the different items that customers place in their shopping baskets.July 7, 2009 Data Mining: R. Akerkar 74
  75. 75. Associations discoveryn Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories ¨ Associations discovery uncovers affinities amongst collection of items ¨ Affinities are represented by association rules ¨ Associations discovery is an unsupervised approach to data mining.July 7, 2009 Data Mining: R. Akerkar 75
  76. 76. Association Rule : Application 2 n Supermarket shelf management. ¨ Goal: To identify items that are bought together by sufficiently many customers. ¨ Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items. ¨A classic rule -- n If a customer buys diaper and milk, then he is very likely to buy beer. n So, don’t be surprised if you find six-packs stacked next to diapers!July 7, 2009 Data Mining: R. Akerkar 76
  77. 77. Association Rule : Application 3n Inventory Management: ¨ Goal: A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households. ¨ Approach: Process the data on tools and parts required in previous repairs at different consumer locations and discover the co- occurrence patterns.July 7, 2009 Data Mining: R. Akerkar 77
  78. 78. What is a rule?n The rule in a rule induction system comes in the form “If this and this and this then this”n For a rule to be useful two pieces of information are needed 1. Accuracy (The lower the accuracy the closer the rule comes to random guessing) 2. Coverage (how often you can use a useful rule)n A Rule consists of two parts 1. The antecedent or the LHS 2. The consequent or the RHS.July 7, 2009 Data Mining: R. Akerkar 78
  79. 79. An exampleRule Accuracy CoverageIf breakfast cereal purchased 85% 20%then milk will be purchasedIf bread purchased, then swiss 15% 6%cheese will be purchasedIf 42 years old and purchased 95% 0.01%dry roasted peanuts, beerpurchasedJuly 7, 2009 Data Mining: R. Akerkar 79
  80. 80. What is association rule mining?July 7, 2009 Data Mining: R. Akerkar 80
  81. 81. Frequent ItemsetJuly 7, 2009 Data Mining: R. Akerkar 81
  82. 82. Support and ConfidenceJuly 7, 2009 Data Mining: R. Akerkar 82
  83. 83. What to do with a rule?n Target the antecedentn Target the consequentn Target based on accuracyn Target based on coveragen Target based on “interestingness”n Antecedent can be one or more conditions all of which must be true in order for the consequent to be true at the given accuracy.n Generally the consequent is just a simple condition (eg purchasing one grocery store item) rather than multiple items.July 7, 2009 Data Mining: R. Akerkar 83
  84. 84. • All rules that have a certain value for the antecedent are gathered and presented to the user. • For example the grocery store may request all rules that have nails or bolts or screws in the antecedent and try and conclude whether discontinuing sales of these lower priced items will have any effect on the higher margin items like hammers.• All rules that have a certain value for the consequent are gathered. Can be used to understand what affects the consequent. • For instance might be useful to know what rules have “coffee” in their RHS. Store owner might want to put coffee close to other items in order to increase sales of both items, or a manufacturer may determine in which magazine to place next coupons.July 7, 2009 Data Mining: R. Akerkar 84
  85. 85. • Sometimes accuracy most important. Highly accurate rules of 80 or 90% of the time imply strong relationships even if the coverage is very low. • For example lets say a rule can only be applied one time out of 1000 but if this rule is very profitable the one time then it can be worthwhile. This is how most successful data mining applications work in the financial markets looking for that limited amount of time in which a very confident prediction can be made.• Sometimes users want to know the rules that are most widely applicable. By looking at rules ranked by coverage they can get a high level view of what is happening in the database most of the time• Rules are interesting when they have high coverage and high accuracy but they deviate from the norm. Eventually there may be a tradeoff between coverage and accuracy can be made using a measure of interestingnessJuly 7, 2009 Data Mining: R. Akerkar 85
  86. 86. Evaluating and using rulesn Look at simple statistics.n Using conjunctions and disjunctionsn Defining “interestingness”n Other HeuristicsJuly 7, 2009 Data Mining: R. Akerkar 86
  87. 87. Using conjunctions anddisjunctionsn This dramatically increases or decreases the coverage. For example ¨ If diet soda or regular soda or beer then potato chips, covers a lot more shopping baskets than just one of the constraints by themselves.July 7, 2009 Data Mining: R. Akerkar 87
  88. 88. Defining “interestingness”n Interestingness must have 4 basic behaviors 1. Interestingness=0. Rule accuracy is equal to background (a priori probability of the LHS), then discard rule. 2. Interestingness increases as accuracy increases if coverage fixed 3. Interestingness increases or decreases with coverage if accuracy stays fixed. 4. Interestingness decreases with coverage for a fixed number of correct responses.July 7, 2009 Data Mining: R. Akerkar 88
  89. 89. Other Heuristics n Look at the actual number of records covered and not as a probability or a percentage. n Compare a given pattern to random chance. This will be an “out of the ordinary measure”. n Keep it simpleJuly 7, 2009 Data Mining: R. Akerkar 89
  90. 90. Example Here t supports items C, DM, and CO. The item DM is supported by 4 out of 6 transactions in T. Thus, the support of DM is 66.6%.July 7, 2009 Data Mining: R. Akerkar 90
  91. 91. DefinitionJuly 7, 2009 Data Mining: R. Akerkar 91
  92. 92. Association Rulesn Algorithms that obtain association rules from data usually divide the task into two parts: ¨ findthe frequent itemsets and ¨ form the rules from them.July 7, 2009 Data Mining: R. Akerkar 92
  93. 93. Association Rulesn The problem of mining association rules can be divided into two subproblems:July 7, 2009 Data Mining: R. Akerkar 93
  94. 94. DefinitionsJuly 7, 2009 Data Mining: R. Akerkar 94
  95. 95. a priori algorithmn Agrawal and Srikant in 1994.n It is also called the level-wise algorithm. ¨ It is the most accepted algorithm for finding all the frequent sets. ¨ It makes use of the downward closure property. ¨ The algorithm is a bottom-up search, progressing upward level-wise in the lattice.n The interesting fact – ¨ before reading the database at every level, it prunes many of the sets, which are unlikely to be frequent sets.July 7, 2009 Data Mining: R. Akerkar 95
  96. 96. a priori algorithmJuly 7, 2009 Data Mining: R. Akerkar 96
  97. 97. a priori candidate-generation methodJuly 7, 2009 Data Mining: R. Akerkar 97
  98. 98. Pruning algorithmJuly 7, 2009 Data Mining: R. Akerkar 98
  99. 99. a priori AlgorithmJuly 7, 2009 Data Mining: R. Akerkar 99
  100. 100. Exercise 3Suppose that L3 is the list {{a,b,c}, {a,b,d}. {a,c,d}, {b,c,d}. {b,c,w}, {b,c,x}, {p,q,r}, {p,q,s}, {p,q,t}, {p,r,s}, {q,r,s}}Which itemsets are placed in C4 by the join step of the Apriori algorithm? Which are then removed by the prune step?July 7, 2009 Data Mining: R. Akerkar 100
  101. 101. Solution3n At the join step of Apriori Algorithm, each member (set) is compared with every other member.n If all the elements of the two members are identical except the right most ones, the union of the two sets is placed into C4.n For the members of L3 given the following sets of four elements are placed into C4: {a,b,c,d}, {b,c,d,w}, {b,c,d,x}, {b,c,w,x}, {p,q,r,s}, {p,q,r,t} and {p,q,s,t}.July 7, 2009 Data Mining: R. Akerkar 101
  102. 102. Solution3 (continued)n At the prune step of the algorithm, each member of C4 is checked to see whether all its subsets of 3 elements are members of L3.n The result in this case is as follows:July 7, 2009 Data Mining: R. Akerkar 102
  103. 103. Solution3 (continued)n Therefore, {b,c,d,w}, {b,c,d,x}, {b,c,w,x}, {p,q,r,t} and {p.q.s.t} are removed by the prune stepn Leaving C4 as, {{a,b,c,d}, {p,q,r,s}}July 7, 2009 Data Mining: R. Akerkar 103
  104. 104. Exercise 4n Given a dataset with four attributes w, x, y and z, each with three values, how many rules can be generated with one term on the right-hand side?July 7, 2009 Data Mining: R. Akerkar 104
  105. 105. Solution 4 n Let us assume that the attribute w has 3 values w1, w2, and w3, and similarly for x, y, and z. n If we select arbitrarily attribute w to be on the right-hand side of each rule, there are 3 possible types of rule: ¨ IF…THEN w=w1 ¨ IF…THEN w=w2 ¨ IF…THEN w=w3 n Now choose one of these rules, say the first, and calculate how many possible left hand sides there are for such rules.July 7, 2009 Data Mining: R. Akerkar 105
  106. 106. Solution 4 (continued)n The number of “attribute=value” terms on the LHS can be 1, 2, or 3.n Case I: One trem on LHS ¨ There are 3 possible terms: x, y, and z. Each has 3 possible values, so there are 3x3=9 possible LHS, e.g. IF x=x1.July 7, 2009 Data Mining: R. Akerkar 106
  107. 107. Solution 4 (continued)n Case II: 2 terms on LHS ¨ There are 3 ways in which combination of 2 attributes may appear on the LHS: x and y, y and z, and x and z. ¨ Each attribute has 3 values, so for each pair there are 3x3=9 possible LHS, e.g. IF x=x1 AND y=y1 ¨ There are 3 possible pairs of attributes, so the totle number of possible LHS is 3x9=27.July 7, 2009 Data Mining: R. Akerkar 107
  108. 108. Solution 4 (continued)n Case III: 3 terms on LHS ¨ All 3 attributes x, y and z must be on LHS. ¨ Each has 3 values, so 3x3x3=27 possible LHS, e.g. IF x=x1 AND y=y1 AND z=z1. ¨ Thus for each of the 3 possible “w=value” terms on the RHS, the total number of LHS with 1,2 or 3 terms is 9+27+27=63. ¨ So there are 3x63 = 189 possible rules with attribute w on the RHS. ¨ The attribute on the RHS could be any of four possibilities (not just w). Therefore total possible number of rules is 4x189=756.July 7, 2009 Data Mining: R. Akerkar 108
  109. 109. Referencesn R. Akerkar and P. Lingras. Building an Intelligent Web: Theory & Practice, Jones & Bartlett, 2008 (In India: Narosa Publishing House, 2009)n U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996n U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001n J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001n D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001July 7, 2009 Data Mining: R. Akerkar 109