Upcoming SlideShare
×

# datamining-algos-IEP.ppt

1,158 views
1,066 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
1,158
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
63
0
Likes
0
Embeds 0
No embeds

No notes for slide
• Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
• Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
• Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
• Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
• Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
• Insert graph of entropy?
• Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
• 2 major problems: o How to find split points that define node tests o How to partition the data having found split points o CART, C4.5: - DFS - Repeated sorting at each node (continuous attributes) o SLIQ - SortOnce technique using attribute lists
• Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
• Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
• Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
• 2 major problems: o How to find split points that define node tests o How to partition the data having found split points o CART, C4.5: - DFS - Repeated sorting at each node (continuous attributes) o SLIQ - SortOnce technique using attribute lists
• Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
• Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
• Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
• Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
• Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
• Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
• Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
• ### datamining-algos-IEP.ppt

1. 1. Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay
2. 2. Overview <ul><li>Decision Tree classification algorithms </li></ul><ul><li>Clustering algorithms </li></ul><ul><li>Challenges </li></ul><ul><li>Resources </li></ul>
3. 3. Decision Tree Classifiers
4. 4. Decision tree classifiers <ul><li>Widely used learning method </li></ul><ul><li>Easy to interpret: can be re-represented as if-then-else rules </li></ul><ul><li>Approximates function by piece wise constant regions </li></ul><ul><li>Does not require any prior knowledge of data distribution, works well on noisy data. </li></ul><ul><li>Has been applied to: </li></ul><ul><ul><li>classify medical patients based on the disease, </li></ul></ul><ul><ul><li>equipment malfunction by cause, </li></ul></ul><ul><ul><li>loan applicant by likelihood of payment. </li></ul></ul>
5. 5. Setting <ul><li>Given old data about customers and payments, predict new applicant’s loan eligibility. </li></ul>Age Salary Profession Location Customer type Previous customers Classifier Decision rules Salary > 5 L Prof. = Exec New applicant’s data Good/ bad
6. 6. <ul><li>Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels. </li></ul>Decision trees Salary < 1 M Prof = teaching Age < 30 Good Bad Bad Good
7. 7. Topics to be covered <ul><li>Tree construction : </li></ul><ul><ul><li>Basic tree learning algorithm </li></ul></ul><ul><ul><li>Measures of predictive ability </li></ul></ul><ul><ul><li>High performance decision tree construction: Sprint </li></ul></ul><ul><li>Tree pruning: </li></ul><ul><ul><li>Why prune </li></ul></ul><ul><ul><li>Methods of pruning </li></ul></ul><ul><li>Other issues: </li></ul><ul><ul><li>Handling missing data </li></ul></ul><ul><ul><li>Continuous class labels </li></ul></ul><ul><ul><li>Effect of training size </li></ul></ul>
8. 8. Tree learning algorithms <ul><li>ID3 (Quinlan 1986) </li></ul><ul><li>Successor C4.5 (Quinlan 1993) </li></ul><ul><li>SLIQ (Mehta et al) </li></ul><ul><li>SPRINT (Shafer et al) </li></ul>
9. 9. Basic algorithm for tree building Greedy top-down construction. Gen_Tree (Node, data) make node a leaf? Yes Stop Find best attribute and best split on attribute Partition data on split condition For each child j of node Gen_Tree (node_j, data_j) Selection criteria
10. 10. Split criteria <ul><li>Select the attribute that is best for classification. </li></ul><ul><li>Intuitively pick one that best separates instances of different classes. </li></ul><ul><li>Quantifying the intuitive: measuring separability: </li></ul><ul><li>First define impurity of an arbitrary set S consisting of K classes </li></ul>1
11. 11. Impurity Measures <ul><li>Information entropy: </li></ul><ul><ul><li>Zero when consisting of only one class, one when all classes in equal number </li></ul></ul><ul><li>Other measures of impurity: Gini: </li></ul>
12. 12. Split criteria <ul><li>K classes, set of S instances partitioned into r subsets. Instance Sj has fraction pij instances of class j. </li></ul><ul><li>Information entropy: </li></ul><ul><li>Gini index: </li></ul>0 1 Impurity 1/4 Gini r =1, k=2
13. 13. Information gain <ul><li>Information gain on partitioning S into r subsets </li></ul><ul><li>Impurity (S) - sum of weighted impurity of each subset </li></ul>
14. 14. Information gain: example S K= 2, |S| = 100, p1= 0.6, p2= 0.4 E(S) = -0.6 log(0.6) - 0.4 log (0.4)=0.29 S1 S2 | S1 | = 70, p1= 0.8, p2= 0.2 E(S1) = -0.8log0.8 - 0.2log0.2 = 0.21 | S2| = 30, p1= 0.13, p2= 0.87 E(S2) = -0.13log0.13 - 0.87 log 0.87=.16 Information gain: E(S) - (0.7 E(S1 ) + 0.3 E(S2) ) =0.1
15. 15. Feature subset selection <ul><li>Embedded: in the mining algorithm </li></ul><ul><li>Filter: features in advance </li></ul><ul><li>Wrapper: generate candidate features and test on black box mining algorithm </li></ul><ul><ul><li>high cost but provide better algorithm dependent features </li></ul></ul>
16. 16. Meta learning methods <ul><li>No single classifier good under all cases </li></ul><ul><li>Difficult to evaluate in advance the conditions </li></ul><ul><li>Meta learning: combine the effects of the classifiers </li></ul><ul><ul><li>Voting: sum up votes of component classifiers </li></ul></ul><ul><ul><li>Combiners: learn a new classifier on the outcomes of previous ones: </li></ul></ul><ul><ul><li>Boosting: staged classifiers </li></ul></ul><ul><li>Disadvantage: interpretation hard </li></ul><ul><ul><li>Knowledge probing: learn single classifier to mimic meta classifier </li></ul></ul>
17. 17. SPRINT (Serial PaRallelizable INduction of decision Trees) <ul><li>Decision-tree classifier for data mining </li></ul><ul><li>Design goals: </li></ul><ul><ul><li>Able to handle large disk-resident training sets </li></ul></ul><ul><ul><li>No restrictions on training-set size </li></ul></ul><ul><ul><li>Easily parallelizable </li></ul></ul>
18. 18. Example <ul><li>Example Data </li></ul>Age < 25 CarType in {sports} High High Low
19. 19. Building tree <ul><li>GrowTree (TrainingData D ) </li></ul><ul><li>Partition( D ); </li></ul><ul><li>Partition (Data D ) </li></ul><ul><li>if (all points in D belong to the same class) then </li></ul><ul><li>return; </li></ul><ul><li>for each attribute A do </li></ul><ul><li>evaluate splits on attribute A; </li></ul><ul><li>use best split found to partition D into D1 and D2; </li></ul><ul><li>Partition( D1 ); </li></ul><ul><li>Partition( D2 ); </li></ul>
20. 20. Data Setup: Attribute Lists <ul><li>One list for each attribute </li></ul><ul><li>Entries in an Attribute List consist of: </li></ul><ul><ul><li>attribute value </li></ul></ul><ul><ul><li>class value </li></ul></ul><ul><ul><li>record id </li></ul></ul><ul><li>Lists for continuous attributes are in sorted order </li></ul><ul><li>Lists may be disk-resident </li></ul><ul><li>Each leaf-node has its own set of attribute lists representing the training examples belonging to that leaf </li></ul>Example list:
21. 21. Attribute Lists: Example Initial Attribute Lists for the root node:
22. 22. Evaluating Split Points <ul><li>Gini Index </li></ul><ul><ul><li>if data D contains examples from c classes </li></ul></ul>Gini(D) = 1 -  pj2 where pj is the relative frequency of class j in D <ul><li>If D split into D1 & D2 with n1 & n2 tuples each </li></ul>Ginisplit(D) = n 1* gini(D1) + n 2* gini(D2) n n <ul><li>Note: Only class frequencies are needed to compute index </li></ul>
23. 23. Finding Split Points <ul><li>For each attribute A do </li></ul><ul><ul><li>evaluate splits on attribute A using attribute list </li></ul></ul><ul><li>Keep split with lowest GINI index </li></ul>
24. 24. Finding Split Points: Continuous Attrib. <ul><li>Consider splits of form: value(A) < x </li></ul><ul><ul><li>Example: Age < 17 </li></ul></ul><ul><li>Evaluate this split-form for every value in an attribute list </li></ul><ul><li>To evaluate splits on attribute A for a given tree-node: </li></ul>Initialize class-histogram of left child to zeroes; Initialize class-histogram of right child to same as its parent; for each record in the attribute list do evaluate splitting index for value(A) < record.value ; using class label of the record, update class histograms;
25. 25. Finding Split Points: Continuous Attrib. Attribute List Position of cursor in scan 0: Age < 17 3: Age < 32 6 State of Class Histograms: Left Child Right Child 1: Age < 20 GINI Index: GINI = undef GINI = 0.4 GINI = 0.222 GINI = undef
26. 26. Finding Split Points: Categorical Attrib. <ul><li>Consider splits of the form: value(A)  {x1, x2, ..., xn} </li></ul><ul><ul><li>Example: CarType  {family, sports} </li></ul></ul><ul><li>Evaluate this split-form for subsets of domain(A) </li></ul><ul><li>To evaluate splits on attribute A for a given tree node: </li></ul>initialize class/value matrix of node to zeroes; for each record in the attribute list do increment appropriate count in matrix; evaluate splitting index for various subsets using the constructed matrix ;
27. 27. Finding Split Points: Categorical Attrib. Attribute List class/value matrix CarType in {family} Left Child Right Child GINI Index: CarType in {truck} GINI = 0.444 GINI = 0.267 CarType in {sports} GINI = 0.333
28. 28. Performing the Splits <ul><li>The attribute lists of every node must be divided among the two children </li></ul><ul><li>To split the attribute lists of a give node: </li></ul>for the list of the attribute used to split this node do use the split test to divide the records; collect the record ids; build a hashtable from the collected ids; for the remaining attribute lists do use the hashtable to divide each list; build class-histograms for each new leaf;
29. 29. Performing the Splits: Example Age < 32 Hash Table 0  Left 1  Left 2  Right 3  Right 4  Right 5  Left
30. 30. Sprint: summary <ul><li>Each node of the decision tree classifier, requires examining possible splits on each value of each attribute. </li></ul><ul><li>After choosing a split attribute, need to partition all data into its subset. </li></ul><ul><li>Need to make this search efficient. </li></ul><ul><li>Evaluating splits on numeric attributes: </li></ul><ul><ul><li>Sort on attribute value, incrementally evaluate gini </li></ul></ul><ul><li>Splits on categorical attributes </li></ul><ul><ul><li>For each subset, find gini and choose the best </li></ul></ul><ul><ul><li>For large sets, use greedy method </li></ul></ul>
31. 31. Preventing overfitting <ul><li>A tree T overfits if there is another tree T’ that gives higher error on the training data yet gives lower error on unseen data. </li></ul><ul><li>An overfitted tree does not generalize to unseen instances. </li></ul><ul><li>Happens when data contains noise or irrelevant attributes and training size is small. </li></ul><ul><li>Overfitting can reduce accuracy drastically: </li></ul><ul><ul><li>(10-25% as reported in Minger’s 1989 Machine learning) </li></ul></ul><ul><li>Occam’s razor: prefer the simplest hypothesis that fits the data </li></ul>
32. 32. Approaches to prevent overfitting <ul><li>Stop growing the tree beyond a certain point </li></ul><ul><li>First over-fit, then post prune. (More widely used) </li></ul><ul><ul><li>Tree building divided into phases: </li></ul></ul><ul><ul><ul><li>Growth phase </li></ul></ul></ul><ul><ul><ul><li>Prune phase </li></ul></ul></ul><ul><li>Hard to decide when to stop growing the tree, so second appraoch more widely used. </li></ul>
33. 33. Criteria for finding correct final tree size: <ul><li>Cross validation with separate test data </li></ul><ul><li>Use all data for training but apply statistical test to decide right size. </li></ul><ul><li>Use some criteria function to choose best size </li></ul><ul><ul><ul><li>Example: Minimum description length (MDL) criteria </li></ul></ul></ul><ul><li>Cross validation approach: </li></ul><ul><ul><li>Partition the dataset into two disjoint parts: </li></ul></ul><ul><ul><ul><li>1. Training set used for building the tree. </li></ul></ul></ul><ul><ul><ul><li>2. Validation set used for pruning the tree </li></ul></ul></ul><ul><ul><li>Build the tree using the training-set. </li></ul></ul><ul><ul><li>Evaluate the tree on the validation set and at each leaf and internal node keep count of correctly labeled data. </li></ul></ul><ul><ul><li>Starting bottom-up, prune nodes with error less than its children. </li></ul></ul>
34. 34. Cross validation.. <ul><li>Need large validation set to smooth out over-fittings of training data. Rule of thumb: one-third. </li></ul><ul><li>What if training data set size is limited? </li></ul><ul><ul><li>Generate many different parititions of data. </li></ul></ul><ul><ul><li>n-fold cross validation: partition training data into n parts D1, D2…Dn. </li></ul></ul><ul><ul><li>Train n classifiers with D-Di as training and Di as test instance. </li></ul></ul><ul><ul><li>Pick average. </li></ul></ul>
35. 35. Rule-based pruning <ul><li>Tree-based pruning limits the kind of pruning. If a node is pruned all subtrees under it has to be pruned. </li></ul><ul><li>Rule-based: For each leaf of the tree, extract a rule using a conjuction of all tests upto the root. </li></ul><ul><li>On the validation set, independently prune tests from each rule to get the highest accuracy for that rule. </li></ul><ul><li>Sort rule by decreasing accuracy.. </li></ul>
36. 36. MDL-based pruning <ul><li>Idea: a branch of the tree is over-fitted if the training examples that fit under it can be explicitly enumerated (with classes) in less space than occupied by tree </li></ul><ul><li>Prune branch if over-fitted </li></ul><ul><ul><li>philosophy: use tree that minimizes description length of training data </li></ul></ul>
37. 37. Regression trees <ul><li>Decision tree with continuous class labels: </li></ul><ul><li>Regression trees approximates the function with piece-wise constant regions. </li></ul><ul><li>Split criteria for regression trees: </li></ul><ul><ul><li>Predicted value for a set S = average of all values in S </li></ul></ul><ul><ul><li>Error: sum of the square of error of each member of S from the predicted average. </li></ul></ul><ul><ul><li>Pick smallest average error. </li></ul></ul>
38. 38. Issues <ul><li>Multiple splits on continuous attributes [Fayyad 93, Multi-interval discretization of continuous attributes] </li></ul><ul><li>Multi attribute tests on nodes to handle correlated attributes </li></ul><ul><ul><li>multivariate linear splits [Oblique trees, Murthy 94] </li></ul></ul><ul><li>Methods of handling missing values </li></ul><ul><ul><li>assume majority value </li></ul></ul><ul><ul><li>take most probable path </li></ul></ul><ul><li>Allowing varying costs for different attributes </li></ul>
39. 39. Pros and Cons of decision trees <ul><li>Cons </li></ul><ul><li>Cannot handle complicated relationship between features </li></ul><ul><li>simple decision boundaries </li></ul><ul><li>problems with lots of missing data </li></ul><ul><li>Pros </li></ul><ul><li>Reasonable training time </li></ul><ul><li>Fast application </li></ul><ul><li>Easy to interpret </li></ul><ul><li>Easy to implement </li></ul><ul><li>Can handle large number of features </li></ul>More information: http://www.recursive-partitioning.com/
40. 40. Clustering or Unsupervised learning
41. 41. Distance functions <ul><li>Numeric data: euclidean, manhattan distances </li></ul><ul><ul><li>Minkowski metric: [sum(xi-yi)^m]^(1/m) </li></ul></ul><ul><ul><li>Larger m gives higher weight to larger distances </li></ul></ul><ul><li>Categorical data: 0/1 to indicate presence/absence </li></ul><ul><ul><li>Euclidean distance: equal weightage to 1 and 0 match </li></ul></ul><ul><ul><li>Hamming distance (# dissimilarity) </li></ul></ul><ul><ul><li>Jaccard coefficients: #similarity in 1s/(# of 1s) (0-0 matches not important </li></ul></ul><ul><li>Combined numeric and categorical data:weighted normalized distance: </li></ul>
42. 42. Distance functions on high dimensional data <ul><li>Example: Time series, Text, Images </li></ul><ul><li>Euclidian measures make all points equally far </li></ul><ul><li>Reduce number of dimensions: </li></ul><ul><ul><li>choose subset of original features using random projections, feature selection techniques </li></ul></ul><ul><ul><li>transform original features using statistical methods like Principal Component Analysis </li></ul></ul><ul><li>Define domain specific similarity measures: e.g. for images define features like number of objects, color histogram; for time series define shape based measures. </li></ul><ul><li>Define non-distance based (model-based) clustering methods: </li></ul>
43. 43. Clustering methods <ul><li>Hierarchical clustering </li></ul><ul><ul><li>agglomerative Vs divisive </li></ul></ul><ul><ul><li>single link Vs complete link </li></ul></ul><ul><li>Partitional clustering </li></ul><ul><ul><li>distance-based: K-means </li></ul></ul><ul><ul><li>model-based: EM </li></ul></ul><ul><ul><li>density-based: </li></ul></ul>
44. 44. Agglomerative Hierarchical clustering <ul><li>Given: matrix of similarity between every point pair </li></ul><ul><li>Start with each point in a separate cluster and merge clusters based on some criteria : </li></ul><ul><ul><li>Single link: merge two clusters such that the minimum distance between two points from the two different cluster is the least </li></ul></ul><ul><ul><li>Complete link: merge two clusters such that all points in one cluster are “close” to all points in the other. </li></ul></ul>
45. 45. Pros and Cons <ul><li>Single link: </li></ul><ul><ul><li>confused by near overlap </li></ul></ul><ul><ul><li>chaining effect </li></ul></ul><ul><li>Complete link </li></ul><ul><ul><li>unnecessary splits of elongated point clouds </li></ul></ul><ul><ul><li>sensitive to outliers </li></ul></ul><ul><li>Several other hierarchical methods known: </li></ul>
46. 46. Partitional methods: K-means <ul><li>Criteria: minimize sum of square of distance </li></ul><ul><ul><ul><li>Between each point and centroid of the cluster. </li></ul></ul></ul><ul><ul><ul><li>Between each pair of points in the cluster </li></ul></ul></ul><ul><li>Algorithm: </li></ul><ul><ul><li>Select initial partition with K clusters: random, first K, K separated points </li></ul></ul><ul><ul><li>Repeat until stabilization: </li></ul></ul><ul><ul><ul><li>Assign each point to closest cluster center </li></ul></ul></ul><ul><ul><ul><li>Generate new cluster centers </li></ul></ul></ul><ul><ul><ul><li>Adjust clusters by merging/splitting </li></ul></ul></ul>
47. 47. Properties <ul><li>May not reach global optima </li></ul><ul><li>Converges fast in practice: guaranteed for certain forms of optimization function </li></ul><ul><li>Complexity: O(KndI): </li></ul><ul><ul><li>I number of iterations, n number of points, d number of dimensions, K number of clusters. </li></ul></ul><ul><li>Database research on scalable algorithms: </li></ul><ul><ul><li>Birch: one/two pass of data by keeping R-tree like index in memory [Sigmod 96] </li></ul></ul>
48. 48. Model based clustering <ul><li>Assume data generated from K probability distributions </li></ul><ul><li>Typically Gaussian distribution Soft or probabilistic version of K-means clustering </li></ul><ul><li>Need to find distribution parameters. </li></ul><ul><li>EM Algorithm </li></ul>
49. 49. EM Algorithm <ul><li>Initialize K cluster centers </li></ul><ul><li>Iterate between two steps </li></ul><ul><ul><li>E xpectation step: assign points to clusters </li></ul></ul><ul><ul><li>M aximation step: estimate model parameters </li></ul></ul>
50. 50. Properties <ul><li>May not reach global optima </li></ul><ul><li>Converges fast in practice: guaranteed for certain forms of optimization function </li></ul><ul><li>Complexity: O(KndI): </li></ul><ul><ul><li>I number of iterations, n number of points, d number of dimensions, K number of clusters. </li></ul></ul>
51. 51. Scalable clustering algorithms <ul><li>Birch: one/two pass of data by keeping R-tree like index in memory [Sigmod 96] </li></ul><ul><li>Fayyad and Bradley: Sample repetitively and update summary of clusters stored in memory (K-mean and EM) [KDD 98] </li></ul><ul><li>Dasgupta 99: Recent theoretical breakthrough, find Gaussian clusters with guaranteed performance </li></ul><ul><ul><li>Random projections </li></ul></ul>
52. 52. Data Mining Challenges <ul><li>Large databases </li></ul><ul><ul><li>sampling, parallel processing, approximation </li></ul></ul><ul><li>High dimensionality </li></ul><ul><ul><li>feature selection, transformation </li></ul></ul><ul><li>Overfitting </li></ul><ul><li>Data evolution </li></ul><ul><li>Missing/noisy data </li></ul><ul><li>Non-obvious patterns </li></ul><ul><li>User interaction/prior knowledge </li></ul><ul><li>Integration with other systems </li></ul>