Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

  • Be the first to comment


  1. 1. Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay
  2. 2. Overview <ul><li>Decision Tree classification algorithms </li></ul><ul><li>Clustering algorithms </li></ul><ul><li>Challenges </li></ul><ul><li>Resources </li></ul>
  3. 3. Decision Tree Classifiers
  4. 4. Decision tree classifiers <ul><li>Widely used learning method </li></ul><ul><li>Easy to interpret: can be re-represented as if-then-else rules </li></ul><ul><li>Approximates function by piece wise constant regions </li></ul><ul><li>Does not require any prior knowledge of data distribution, works well on noisy data. </li></ul><ul><li>Has been applied to: </li></ul><ul><ul><li>classify medical patients based on the disease, </li></ul></ul><ul><ul><li>equipment malfunction by cause, </li></ul></ul><ul><ul><li>loan applicant by likelihood of payment. </li></ul></ul>
  5. 5. Setting <ul><li>Given old data about customers and payments, predict new applicant’s loan eligibility. </li></ul>Age Salary Profession Location Customer type Previous customers Classifier Decision rules Salary > 5 L Prof. = Exec New applicant’s data Good/ bad
  6. 6. <ul><li>Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels. </li></ul>Decision trees Salary < 1 M Prof = teaching Age < 30 Good Bad Bad Good
  7. 7. Topics to be covered <ul><li>Tree construction : </li></ul><ul><ul><li>Basic tree learning algorithm </li></ul></ul><ul><ul><li>Measures of predictive ability </li></ul></ul><ul><ul><li>High performance decision tree construction: Sprint </li></ul></ul><ul><li>Tree pruning: </li></ul><ul><ul><li>Why prune </li></ul></ul><ul><ul><li>Methods of pruning </li></ul></ul><ul><li>Other issues: </li></ul><ul><ul><li>Handling missing data </li></ul></ul><ul><ul><li>Continuous class labels </li></ul></ul><ul><ul><li>Effect of training size </li></ul></ul>
  8. 8. Tree learning algorithms <ul><li>ID3 (Quinlan 1986) </li></ul><ul><li>Successor C4.5 (Quinlan 1993) </li></ul><ul><li>SLIQ (Mehta et al) </li></ul><ul><li>SPRINT (Shafer et al) </li></ul>
  9. 9. Basic algorithm for tree building Greedy top-down construction. Gen_Tree (Node, data) make node a leaf? Yes Stop Find best attribute and best split on attribute Partition data on split condition For each child j of node Gen_Tree (node_j, data_j) Selection criteria
  10. 10. Split criteria <ul><li>Select the attribute that is best for classification. </li></ul><ul><li>Intuitively pick one that best separates instances of different classes. </li></ul><ul><li>Quantifying the intuitive: measuring separability: </li></ul><ul><li>First define impurity of an arbitrary set S consisting of K classes </li></ul>1
  11. 11. Impurity Measures <ul><li>Information entropy: </li></ul><ul><ul><li>Zero when consisting of only one class, one when all classes in equal number </li></ul></ul><ul><li>Other measures of impurity: Gini: </li></ul>
  12. 12. Split criteria <ul><li>K classes, set of S instances partitioned into r subsets. Instance Sj has fraction pij instances of class j. </li></ul><ul><li>Information entropy: </li></ul><ul><li>Gini index: </li></ul>0 1 Impurity 1/4 Gini r =1, k=2
  13. 13. Information gain <ul><li>Information gain on partitioning S into r subsets </li></ul><ul><li>Impurity (S) - sum of weighted impurity of each subset </li></ul>
  14. 14. Information gain: example S K= 2, |S| = 100, p1= 0.6, p2= 0.4 E(S) = -0.6 log(0.6) - 0.4 log (0.4)=0.29 S1 S2 | S1 | = 70, p1= 0.8, p2= 0.2 E(S1) = -0.8log0.8 - 0.2log0.2 = 0.21 | S2| = 30, p1= 0.13, p2= 0.87 E(S2) = -0.13log0.13 - 0.87 log 0.87=.16 Information gain: E(S) - (0.7 E(S1 ) + 0.3 E(S2) ) =0.1
  15. 15. Feature subset selection <ul><li>Embedded: in the mining algorithm </li></ul><ul><li>Filter: features in advance </li></ul><ul><li>Wrapper: generate candidate features and test on black box mining algorithm </li></ul><ul><ul><li>high cost but provide better algorithm dependent features </li></ul></ul>
  16. 16. Meta learning methods <ul><li>No single classifier good under all cases </li></ul><ul><li>Difficult to evaluate in advance the conditions </li></ul><ul><li>Meta learning: combine the effects of the classifiers </li></ul><ul><ul><li>Voting: sum up votes of component classifiers </li></ul></ul><ul><ul><li>Combiners: learn a new classifier on the outcomes of previous ones: </li></ul></ul><ul><ul><li>Boosting: staged classifiers </li></ul></ul><ul><li>Disadvantage: interpretation hard </li></ul><ul><ul><li>Knowledge probing: learn single classifier to mimic meta classifier </li></ul></ul>
  17. 17. SPRINT (Serial PaRallelizable INduction of decision Trees) <ul><li>Decision-tree classifier for data mining </li></ul><ul><li>Design goals: </li></ul><ul><ul><li>Able to handle large disk-resident training sets </li></ul></ul><ul><ul><li>No restrictions on training-set size </li></ul></ul><ul><ul><li>Easily parallelizable </li></ul></ul>
  18. 18. Example <ul><li>Example Data </li></ul>Age < 25 CarType in {sports} High High Low
  19. 19. Building tree <ul><li>GrowTree (TrainingData D ) </li></ul><ul><li>Partition( D ); </li></ul><ul><li>Partition (Data D ) </li></ul><ul><li>if (all points in D belong to the same class) then </li></ul><ul><li>return; </li></ul><ul><li>for each attribute A do </li></ul><ul><li>evaluate splits on attribute A; </li></ul><ul><li>use best split found to partition D into D1 and D2; </li></ul><ul><li>Partition( D1 ); </li></ul><ul><li>Partition( D2 ); </li></ul>
  20. 20. Data Setup: Attribute Lists <ul><li>One list for each attribute </li></ul><ul><li>Entries in an Attribute List consist of: </li></ul><ul><ul><li>attribute value </li></ul></ul><ul><ul><li>class value </li></ul></ul><ul><ul><li>record id </li></ul></ul><ul><li>Lists for continuous attributes are in sorted order </li></ul><ul><li>Lists may be disk-resident </li></ul><ul><li>Each leaf-node has its own set of attribute lists representing the training examples belonging to that leaf </li></ul>Example list:
  21. 21. Attribute Lists: Example Initial Attribute Lists for the root node:
  22. 22. Evaluating Split Points <ul><li>Gini Index </li></ul><ul><ul><li>if data D contains examples from c classes </li></ul></ul>Gini(D) = 1 -  pj2 where pj is the relative frequency of class j in D <ul><li>If D split into D1 & D2 with n1 & n2 tuples each </li></ul>Ginisplit(D) = n 1* gini(D1) + n 2* gini(D2) n n <ul><li>Note: Only class frequencies are needed to compute index </li></ul>
  23. 23. Finding Split Points <ul><li>For each attribute A do </li></ul><ul><ul><li>evaluate splits on attribute A using attribute list </li></ul></ul><ul><li>Keep split with lowest GINI index </li></ul>
  24. 24. Finding Split Points: Continuous Attrib. <ul><li>Consider splits of form: value(A) < x </li></ul><ul><ul><li>Example: Age < 17 </li></ul></ul><ul><li>Evaluate this split-form for every value in an attribute list </li></ul><ul><li>To evaluate splits on attribute A for a given tree-node: </li></ul>Initialize class-histogram of left child to zeroes; Initialize class-histogram of right child to same as its parent; for each record in the attribute list do evaluate splitting index for value(A) < record.value ; using class label of the record, update class histograms;
  25. 25. Finding Split Points: Continuous Attrib. Attribute List Position of cursor in scan 0: Age < 17 3: Age < 32 6 State of Class Histograms: Left Child Right Child 1: Age < 20 GINI Index: GINI = undef GINI = 0.4 GINI = 0.222 GINI = undef
  26. 26. Finding Split Points: Categorical Attrib. <ul><li>Consider splits of the form: value(A)  {x1, x2, ..., xn} </li></ul><ul><ul><li>Example: CarType  {family, sports} </li></ul></ul><ul><li>Evaluate this split-form for subsets of domain(A) </li></ul><ul><li>To evaluate splits on attribute A for a given tree node: </li></ul>initialize class/value matrix of node to zeroes; for each record in the attribute list do increment appropriate count in matrix; evaluate splitting index for various subsets using the constructed matrix ;
  27. 27. Finding Split Points: Categorical Attrib. Attribute List class/value matrix CarType in {family} Left Child Right Child GINI Index: CarType in {truck} GINI = 0.444 GINI = 0.267 CarType in {sports} GINI = 0.333
  28. 28. Performing the Splits <ul><li>The attribute lists of every node must be divided among the two children </li></ul><ul><li>To split the attribute lists of a give node: </li></ul>for the list of the attribute used to split this node do use the split test to divide the records; collect the record ids; build a hashtable from the collected ids; for the remaining attribute lists do use the hashtable to divide each list; build class-histograms for each new leaf;
  29. 29. Performing the Splits: Example Age < 32 Hash Table 0  Left 1  Left 2  Right 3  Right 4  Right 5  Left
  30. 30. Sprint: summary <ul><li>Each node of the decision tree classifier, requires examining possible splits on each value of each attribute. </li></ul><ul><li>After choosing a split attribute, need to partition all data into its subset. </li></ul><ul><li>Need to make this search efficient. </li></ul><ul><li>Evaluating splits on numeric attributes: </li></ul><ul><ul><li>Sort on attribute value, incrementally evaluate gini </li></ul></ul><ul><li>Splits on categorical attributes </li></ul><ul><ul><li>For each subset, find gini and choose the best </li></ul></ul><ul><ul><li>For large sets, use greedy method </li></ul></ul>
  31. 31. Preventing overfitting <ul><li>A tree T overfits if there is another tree T’ that gives higher error on the training data yet gives lower error on unseen data. </li></ul><ul><li>An overfitted tree does not generalize to unseen instances. </li></ul><ul><li>Happens when data contains noise or irrelevant attributes and training size is small. </li></ul><ul><li>Overfitting can reduce accuracy drastically: </li></ul><ul><ul><li>(10-25% as reported in Minger’s 1989 Machine learning) </li></ul></ul><ul><li>Occam’s razor: prefer the simplest hypothesis that fits the data </li></ul>
  32. 32. Approaches to prevent overfitting <ul><li>Stop growing the tree beyond a certain point </li></ul><ul><li>First over-fit, then post prune. (More widely used) </li></ul><ul><ul><li>Tree building divided into phases: </li></ul></ul><ul><ul><ul><li>Growth phase </li></ul></ul></ul><ul><ul><ul><li>Prune phase </li></ul></ul></ul><ul><li>Hard to decide when to stop growing the tree, so second appraoch more widely used. </li></ul>
  33. 33. Criteria for finding correct final tree size: <ul><li>Cross validation with separate test data </li></ul><ul><li>Use all data for training but apply statistical test to decide right size. </li></ul><ul><li>Use some criteria function to choose best size </li></ul><ul><ul><ul><li>Example: Minimum description length (MDL) criteria </li></ul></ul></ul><ul><li>Cross validation approach: </li></ul><ul><ul><li>Partition the dataset into two disjoint parts: </li></ul></ul><ul><ul><ul><li>1. Training set used for building the tree. </li></ul></ul></ul><ul><ul><ul><li>2. Validation set used for pruning the tree </li></ul></ul></ul><ul><ul><li>Build the tree using the training-set. </li></ul></ul><ul><ul><li>Evaluate the tree on the validation set and at each leaf and internal node keep count of correctly labeled data. </li></ul></ul><ul><ul><li>Starting bottom-up, prune nodes with error less than its children. </li></ul></ul>
  34. 34. Cross validation.. <ul><li>Need large validation set to smooth out over-fittings of training data. Rule of thumb: one-third. </li></ul><ul><li>What if training data set size is limited? </li></ul><ul><ul><li>Generate many different parititions of data. </li></ul></ul><ul><ul><li>n-fold cross validation: partition training data into n parts D1, D2…Dn. </li></ul></ul><ul><ul><li>Train n classifiers with D-Di as training and Di as test instance. </li></ul></ul><ul><ul><li>Pick average. </li></ul></ul>
  35. 35. Rule-based pruning <ul><li>Tree-based pruning limits the kind of pruning. If a node is pruned all subtrees under it has to be pruned. </li></ul><ul><li>Rule-based: For each leaf of the tree, extract a rule using a conjuction of all tests upto the root. </li></ul><ul><li>On the validation set, independently prune tests from each rule to get the highest accuracy for that rule. </li></ul><ul><li>Sort rule by decreasing accuracy.. </li></ul>
  36. 36. MDL-based pruning <ul><li>Idea: a branch of the tree is over-fitted if the training examples that fit under it can be explicitly enumerated (with classes) in less space than occupied by tree </li></ul><ul><li>Prune branch if over-fitted </li></ul><ul><ul><li>philosophy: use tree that minimizes description length of training data </li></ul></ul>
  37. 37. Regression trees <ul><li>Decision tree with continuous class labels: </li></ul><ul><li>Regression trees approximates the function with piece-wise constant regions. </li></ul><ul><li>Split criteria for regression trees: </li></ul><ul><ul><li>Predicted value for a set S = average of all values in S </li></ul></ul><ul><ul><li>Error: sum of the square of error of each member of S from the predicted average. </li></ul></ul><ul><ul><li>Pick smallest average error. </li></ul></ul>
  38. 38. Issues <ul><li>Multiple splits on continuous attributes [Fayyad 93, Multi-interval discretization of continuous attributes] </li></ul><ul><li>Multi attribute tests on nodes to handle correlated attributes </li></ul><ul><ul><li>multivariate linear splits [Oblique trees, Murthy 94] </li></ul></ul><ul><li>Methods of handling missing values </li></ul><ul><ul><li>assume majority value </li></ul></ul><ul><ul><li>take most probable path </li></ul></ul><ul><li>Allowing varying costs for different attributes </li></ul>
  39. 39. Pros and Cons of decision trees <ul><li>Cons </li></ul><ul><li>Cannot handle complicated relationship between features </li></ul><ul><li>simple decision boundaries </li></ul><ul><li>problems with lots of missing data </li></ul><ul><li>Pros </li></ul><ul><li>Reasonable training time </li></ul><ul><li>Fast application </li></ul><ul><li>Easy to interpret </li></ul><ul><li>Easy to implement </li></ul><ul><li>Can handle large number of features </li></ul>More information:
  40. 40. Clustering or Unsupervised learning
  41. 41. Distance functions <ul><li>Numeric data: euclidean, manhattan distances </li></ul><ul><ul><li>Minkowski metric: [sum(xi-yi)^m]^(1/m) </li></ul></ul><ul><ul><li>Larger m gives higher weight to larger distances </li></ul></ul><ul><li>Categorical data: 0/1 to indicate presence/absence </li></ul><ul><ul><li>Euclidean distance: equal weightage to 1 and 0 match </li></ul></ul><ul><ul><li>Hamming distance (# dissimilarity) </li></ul></ul><ul><ul><li>Jaccard coefficients: #similarity in 1s/(# of 1s) (0-0 matches not important </li></ul></ul><ul><li>Combined numeric and categorical data:weighted normalized distance: </li></ul>
  42. 42. Distance functions on high dimensional data <ul><li>Example: Time series, Text, Images </li></ul><ul><li>Euclidian measures make all points equally far </li></ul><ul><li>Reduce number of dimensions: </li></ul><ul><ul><li>choose subset of original features using random projections, feature selection techniques </li></ul></ul><ul><ul><li>transform original features using statistical methods like Principal Component Analysis </li></ul></ul><ul><li>Define domain specific similarity measures: e.g. for images define features like number of objects, color histogram; for time series define shape based measures. </li></ul><ul><li>Define non-distance based (model-based) clustering methods: </li></ul>
  43. 43. Clustering methods <ul><li>Hierarchical clustering </li></ul><ul><ul><li>agglomerative Vs divisive </li></ul></ul><ul><ul><li>single link Vs complete link </li></ul></ul><ul><li>Partitional clustering </li></ul><ul><ul><li>distance-based: K-means </li></ul></ul><ul><ul><li>model-based: EM </li></ul></ul><ul><ul><li>density-based: </li></ul></ul>
  44. 44. Agglomerative Hierarchical clustering <ul><li>Given: matrix of similarity between every point pair </li></ul><ul><li>Start with each point in a separate cluster and merge clusters based on some criteria : </li></ul><ul><ul><li>Single link: merge two clusters such that the minimum distance between two points from the two different cluster is the least </li></ul></ul><ul><ul><li>Complete link: merge two clusters such that all points in one cluster are “close” to all points in the other. </li></ul></ul>
  45. 45. Pros and Cons <ul><li>Single link: </li></ul><ul><ul><li>confused by near overlap </li></ul></ul><ul><ul><li>chaining effect </li></ul></ul><ul><li>Complete link </li></ul><ul><ul><li>unnecessary splits of elongated point clouds </li></ul></ul><ul><ul><li>sensitive to outliers </li></ul></ul><ul><li>Several other hierarchical methods known: </li></ul>
  46. 46. Partitional methods: K-means <ul><li>Criteria: minimize sum of square of distance </li></ul><ul><ul><ul><li>Between each point and centroid of the cluster. </li></ul></ul></ul><ul><ul><ul><li>Between each pair of points in the cluster </li></ul></ul></ul><ul><li>Algorithm: </li></ul><ul><ul><li>Select initial partition with K clusters: random, first K, K separated points </li></ul></ul><ul><ul><li>Repeat until stabilization: </li></ul></ul><ul><ul><ul><li>Assign each point to closest cluster center </li></ul></ul></ul><ul><ul><ul><li>Generate new cluster centers </li></ul></ul></ul><ul><ul><ul><li>Adjust clusters by merging/splitting </li></ul></ul></ul>
  47. 47. Properties <ul><li>May not reach global optima </li></ul><ul><li>Converges fast in practice: guaranteed for certain forms of optimization function </li></ul><ul><li>Complexity: O(KndI): </li></ul><ul><ul><li>I number of iterations, n number of points, d number of dimensions, K number of clusters. </li></ul></ul><ul><li>Database research on scalable algorithms: </li></ul><ul><ul><li>Birch: one/two pass of data by keeping R-tree like index in memory [Sigmod 96] </li></ul></ul>
  48. 48. Model based clustering <ul><li>Assume data generated from K probability distributions </li></ul><ul><li>Typically Gaussian distribution Soft or probabilistic version of K-means clustering </li></ul><ul><li>Need to find distribution parameters. </li></ul><ul><li>EM Algorithm </li></ul>
  49. 49. EM Algorithm <ul><li>Initialize K cluster centers </li></ul><ul><li>Iterate between two steps </li></ul><ul><ul><li>E xpectation step: assign points to clusters </li></ul></ul><ul><ul><li>M aximation step: estimate model parameters </li></ul></ul>
  50. 50. Properties <ul><li>May not reach global optima </li></ul><ul><li>Converges fast in practice: guaranteed for certain forms of optimization function </li></ul><ul><li>Complexity: O(KndI): </li></ul><ul><ul><li>I number of iterations, n number of points, d number of dimensions, K number of clusters. </li></ul></ul>
  51. 51. Scalable clustering algorithms <ul><li>Birch: one/two pass of data by keeping R-tree like index in memory [Sigmod 96] </li></ul><ul><li>Fayyad and Bradley: Sample repetitively and update summary of clusters stored in memory (K-mean and EM) [KDD 98] </li></ul><ul><li>Dasgupta 99: Recent theoretical breakthrough, find Gaussian clusters with guaranteed performance </li></ul><ul><ul><li>Random projections </li></ul></ul>
  52. 52. Data Mining Challenges <ul><li>Large databases </li></ul><ul><ul><li>sampling, parallel processing, approximation </li></ul></ul><ul><li>High dimensionality </li></ul><ul><ul><li>feature selection, transformation </li></ul></ul><ul><li>Overfitting </li></ul><ul><li>Data evolution </li></ul><ul><li>Missing/noisy data </li></ul><ul><li>Non-obvious patterns </li></ul><ul><li>User interaction/prior knowledge </li></ul><ul><li>Integration with other systems </li></ul>
  53. 53. To Learn More
  54. 54. Books <ul><li>Ian H. Witten and Frank Eibe, Data mining : practical machine learning tools and techniques with Java implementations , Morgan Kaufmann, 1999 </li></ul><ul><li>Usama Fayyad et al. (eds), Advances in Knowledge Discovery and Data Mining , AAAI/MIT Press, 1996 </li></ul><ul><li>Tom Mitchell, Machine Learning , McGraw-Hill </li></ul>
  55. 55. Software <ul><li>Public domain </li></ul><ul><ul><li>Weka 3: data mining algos in Java ( </li></ul></ul><ul><ul><ul><li>classification, regression </li></ul></ul></ul><ul><ul><li>MLC++: data mining tools in C++ </li></ul></ul><ul><ul><ul><li>mainly classification </li></ul></ul></ul><ul><li>Free for universities </li></ul><ul><ul><li>try convincing IBM to give it free! </li></ul></ul><ul><li>Datasets: follow links from to UC Irvine site </li></ul>
  56. 56. Resources <ul><li> </li></ul><ul><ul><li>Great site with links to software, datasets etc. Be sure to visit it. </li></ul></ul><ul><li> </li></ul><ul><li>OLAP: </li></ul><ul><li>SIGKDD: </li></ul><ul><li>Data mining and knowledge discovery journal: </li></ul><ul><li>Communications of ACM Special Issue on Data Mining, Nov 1996 </li></ul>
  57. 57. Resources at IITB <ul><li> </li></ul><ul><ul><li>IITB DB group home page </li></ul></ul><ul><li> </li></ul><ul><ul><li>Data Warehousing and Data Mining course offered by Prof. Sunita Sarawagi at IITB </li></ul></ul>