Your SlideShare is downloading. ×
datamining-algos-IEP.ppt
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

datamining-algos-IEP.ppt

956

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
956
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
52
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
  • Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
  • Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
  • Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
  • Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
  • Insert graph of entropy?
  • Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
  • 2 major problems: o How to find split points that define node tests o How to partition the data having found split points o CART, C4.5: - DFS - Repeated sorting at each node (continuous attributes) o SLIQ - SortOnce technique using attribute lists
  • Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
  • Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
  • Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
  • 2 major problems: o How to find split points that define node tests o How to partition the data having found split points o CART, C4.5: - DFS - Repeated sorting at each node (continuous attributes) o SLIQ - SortOnce technique using attribute lists
  • Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
  • Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
  • Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
  • Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
  • Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
  • Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
  • Cover number of topics on this subject Goal: commentary, cover system integration and architectural aspects, shortcomings of existing systems and future directions.. Not a teaching lecture. Details of specific algorithms covered on demand. Trailer: details can be addressed later.
  • Transcript

    • 1. Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay
    • 2. Overview
      • Decision Tree classification algorithms
      • Clustering algorithms
      • Challenges
      • Resources
    • 3. Decision Tree Classifiers
    • 4. Decision tree classifiers
      • Widely used learning method
      • Easy to interpret: can be re-represented as if-then-else rules
      • Approximates function by piece wise constant regions
      • Does not require any prior knowledge of data distribution, works well on noisy data.
      • Has been applied to:
        • classify medical patients based on the disease,
        • equipment malfunction by cause,
        • loan applicant by likelihood of payment.
    • 5. Setting
      • Given old data about customers and payments, predict new applicant’s loan eligibility.
      Age Salary Profession Location Customer type Previous customers Classifier Decision rules Salary > 5 L Prof. = Exec New applicant’s data Good/ bad
    • 6.
      • Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels.
      Decision trees Salary < 1 M Prof = teaching Age < 30 Good Bad Bad Good
    • 7. Topics to be covered
      • Tree construction :
        • Basic tree learning algorithm
        • Measures of predictive ability
        • High performance decision tree construction: Sprint
      • Tree pruning:
        • Why prune
        • Methods of pruning
      • Other issues:
        • Handling missing data
        • Continuous class labels
        • Effect of training size
    • 8. Tree learning algorithms
      • ID3 (Quinlan 1986)
      • Successor C4.5 (Quinlan 1993)
      • SLIQ (Mehta et al)
      • SPRINT (Shafer et al)
    • 9. Basic algorithm for tree building Greedy top-down construction. Gen_Tree (Node, data) make node a leaf? Yes Stop Find best attribute and best split on attribute Partition data on split condition For each child j of node Gen_Tree (node_j, data_j) Selection criteria
    • 10. Split criteria
      • Select the attribute that is best for classification.
      • Intuitively pick one that best separates instances of different classes.
      • Quantifying the intuitive: measuring separability:
      • First define impurity of an arbitrary set S consisting of K classes
      1
    • 11. Impurity Measures
      • Information entropy:
        • Zero when consisting of only one class, one when all classes in equal number
      • Other measures of impurity: Gini:
    • 12. Split criteria
      • K classes, set of S instances partitioned into r subsets. Instance Sj has fraction pij instances of class j.
      • Information entropy:
      • Gini index:
      0 1 Impurity 1/4 Gini r =1, k=2
    • 13. Information gain
      • Information gain on partitioning S into r subsets
      • Impurity (S) - sum of weighted impurity of each subset
    • 14. Information gain: example S K= 2, |S| = 100, p1= 0.6, p2= 0.4 E(S) = -0.6 log(0.6) - 0.4 log (0.4)=0.29 S1 S2 | S1 | = 70, p1= 0.8, p2= 0.2 E(S1) = -0.8log0.8 - 0.2log0.2 = 0.21 | S2| = 30, p1= 0.13, p2= 0.87 E(S2) = -0.13log0.13 - 0.87 log 0.87=.16 Information gain: E(S) - (0.7 E(S1 ) + 0.3 E(S2) ) =0.1
    • 15. Feature subset selection
      • Embedded: in the mining algorithm
      • Filter: features in advance
      • Wrapper: generate candidate features and test on black box mining algorithm
        • high cost but provide better algorithm dependent features
    • 16. Meta learning methods
      • No single classifier good under all cases
      • Difficult to evaluate in advance the conditions
      • Meta learning: combine the effects of the classifiers
        • Voting: sum up votes of component classifiers
        • Combiners: learn a new classifier on the outcomes of previous ones:
        • Boosting: staged classifiers
      • Disadvantage: interpretation hard
        • Knowledge probing: learn single classifier to mimic meta classifier
    • 17. SPRINT (Serial PaRallelizable INduction of decision Trees)
      • Decision-tree classifier for data mining
      • Design goals:
        • Able to handle large disk-resident training sets
        • No restrictions on training-set size
        • Easily parallelizable
    • 18. Example
      • Example Data
      Age < 25 CarType in {sports} High High Low
    • 19. Building tree
      • GrowTree (TrainingData D )
      • Partition( D );
      • Partition (Data D )
      • if (all points in D belong to the same class) then
      • return;
      • for each attribute A do
      • evaluate splits on attribute A;
      • use best split found to partition D into D1 and D2;
      • Partition( D1 );
      • Partition( D2 );
    • 20. Data Setup: Attribute Lists
      • One list for each attribute
      • Entries in an Attribute List consist of:
        • attribute value
        • class value
        • record id
      • Lists for continuous attributes are in sorted order
      • Lists may be disk-resident
      • Each leaf-node has its own set of attribute lists representing the training examples belonging to that leaf
      Example list:
    • 21. Attribute Lists: Example Initial Attribute Lists for the root node:
    • 22. Evaluating Split Points
      • Gini Index
        • if data D contains examples from c classes
      Gini(D) = 1 -  pj2 where pj is the relative frequency of class j in D
      • If D split into D1 & D2 with n1 & n2 tuples each
      Ginisplit(D) = n 1* gini(D1) + n 2* gini(D2) n n
      • Note: Only class frequencies are needed to compute index
    • 23. Finding Split Points
      • For each attribute A do
        • evaluate splits on attribute A using attribute list
      • Keep split with lowest GINI index
    • 24. Finding Split Points: Continuous Attrib.
      • Consider splits of form: value(A) < x
        • Example: Age < 17
      • Evaluate this split-form for every value in an attribute list
      • To evaluate splits on attribute A for a given tree-node:
      Initialize class-histogram of left child to zeroes; Initialize class-histogram of right child to same as its parent; for each record in the attribute list do evaluate splitting index for value(A) < record.value ; using class label of the record, update class histograms;
    • 25. Finding Split Points: Continuous Attrib. Attribute List Position of cursor in scan 0: Age < 17 3: Age < 32 6 State of Class Histograms: Left Child Right Child 1: Age < 20 GINI Index: GINI = undef GINI = 0.4 GINI = 0.222 GINI = undef
    • 26. Finding Split Points: Categorical Attrib.
      • Consider splits of the form: value(A)  {x1, x2, ..., xn}
        • Example: CarType  {family, sports}
      • Evaluate this split-form for subsets of domain(A)
      • To evaluate splits on attribute A for a given tree node:
      initialize class/value matrix of node to zeroes; for each record in the attribute list do increment appropriate count in matrix; evaluate splitting index for various subsets using the constructed matrix ;
    • 27. Finding Split Points: Categorical Attrib. Attribute List class/value matrix CarType in {family} Left Child Right Child GINI Index: CarType in {truck} GINI = 0.444 GINI = 0.267 CarType in {sports} GINI = 0.333
    • 28. Performing the Splits
      • The attribute lists of every node must be divided among the two children
      • To split the attribute lists of a give node:
      for the list of the attribute used to split this node do use the split test to divide the records; collect the record ids; build a hashtable from the collected ids; for the remaining attribute lists do use the hashtable to divide each list; build class-histograms for each new leaf;
    • 29. Performing the Splits: Example Age < 32 Hash Table 0  Left 1  Left 2  Right 3  Right 4  Right 5  Left
    • 30. Sprint: summary
      • Each node of the decision tree classifier, requires examining possible splits on each value of each attribute.
      • After choosing a split attribute, need to partition all data into its subset.
      • Need to make this search efficient.
      • Evaluating splits on numeric attributes:
        • Sort on attribute value, incrementally evaluate gini
      • Splits on categorical attributes
        • For each subset, find gini and choose the best
        • For large sets, use greedy method
    • 31. Preventing overfitting
      • A tree T overfits if there is another tree T’ that gives higher error on the training data yet gives lower error on unseen data.
      • An overfitted tree does not generalize to unseen instances.
      • Happens when data contains noise or irrelevant attributes and training size is small.
      • Overfitting can reduce accuracy drastically:
        • (10-25% as reported in Minger’s 1989 Machine learning)
      • Occam’s razor: prefer the simplest hypothesis that fits the data
    • 32. Approaches to prevent overfitting
      • Stop growing the tree beyond a certain point
      • First over-fit, then post prune. (More widely used)
        • Tree building divided into phases:
          • Growth phase
          • Prune phase
      • Hard to decide when to stop growing the tree, so second appraoch more widely used.
    • 33. Criteria for finding correct final tree size:
      • Cross validation with separate test data
      • Use all data for training but apply statistical test to decide right size.
      • Use some criteria function to choose best size
          • Example: Minimum description length (MDL) criteria
      • Cross validation approach:
        • Partition the dataset into two disjoint parts:
          • 1. Training set used for building the tree.
          • 2. Validation set used for pruning the tree
        • Build the tree using the training-set.
        • Evaluate the tree on the validation set and at each leaf and internal node keep count of correctly labeled data.
        • Starting bottom-up, prune nodes with error less than its children.
    • 34. Cross validation..
      • Need large validation set to smooth out over-fittings of training data. Rule of thumb: one-third.
      • What if training data set size is limited?
        • Generate many different parititions of data.
        • n-fold cross validation: partition training data into n parts D1, D2…Dn.
        • Train n classifiers with D-Di as training and Di as test instance.
        • Pick average.
    • 35. Rule-based pruning
      • Tree-based pruning limits the kind of pruning. If a node is pruned all subtrees under it has to be pruned.
      • Rule-based: For each leaf of the tree, extract a rule using a conjuction of all tests upto the root.
      • On the validation set, independently prune tests from each rule to get the highest accuracy for that rule.
      • Sort rule by decreasing accuracy..
    • 36. MDL-based pruning
      • Idea: a branch of the tree is over-fitted if the training examples that fit under it can be explicitly enumerated (with classes) in less space than occupied by tree
      • Prune branch if over-fitted
        • philosophy: use tree that minimizes description length of training data
    • 37. Regression trees
      • Decision tree with continuous class labels:
      • Regression trees approximates the function with piece-wise constant regions.
      • Split criteria for regression trees:
        • Predicted value for a set S = average of all values in S
        • Error: sum of the square of error of each member of S from the predicted average.
        • Pick smallest average error.
    • 38. Issues
      • Multiple splits on continuous attributes [Fayyad 93, Multi-interval discretization of continuous attributes]
      • Multi attribute tests on nodes to handle correlated attributes
        • multivariate linear splits [Oblique trees, Murthy 94]
      • Methods of handling missing values
        • assume majority value
        • take most probable path
      • Allowing varying costs for different attributes
    • 39. Pros and Cons of decision trees
      • Cons
      • Cannot handle complicated relationship between features
      • simple decision boundaries
      • problems with lots of missing data
      • Pros
      • Reasonable training time
      • Fast application
      • Easy to interpret
      • Easy to implement
      • Can handle large number of features
      More information: http://www.recursive-partitioning.com/
    • 40. Clustering or Unsupervised learning
    • 41. Distance functions
      • Numeric data: euclidean, manhattan distances
        • Minkowski metric: [sum(xi-yi)^m]^(1/m)
        • Larger m gives higher weight to larger distances
      • Categorical data: 0/1 to indicate presence/absence
        • Euclidean distance: equal weightage to 1 and 0 match
        • Hamming distance (# dissimilarity)
        • Jaccard coefficients: #similarity in 1s/(# of 1s) (0-0 matches not important
      • Combined numeric and categorical data:weighted normalized distance:
    • 42. Distance functions on high dimensional data
      • Example: Time series, Text, Images
      • Euclidian measures make all points equally far
      • Reduce number of dimensions:
        • choose subset of original features using random projections, feature selection techniques
        • transform original features using statistical methods like Principal Component Analysis
      • Define domain specific similarity measures: e.g. for images define features like number of objects, color histogram; for time series define shape based measures.
      • Define non-distance based (model-based) clustering methods:
    • 43. Clustering methods
      • Hierarchical clustering
        • agglomerative Vs divisive
        • single link Vs complete link
      • Partitional clustering
        • distance-based: K-means
        • model-based: EM
        • density-based:
    • 44. Agglomerative Hierarchical clustering
      • Given: matrix of similarity between every point pair
      • Start with each point in a separate cluster and merge clusters based on some criteria :
        • Single link: merge two clusters such that the minimum distance between two points from the two different cluster is the least
        • Complete link: merge two clusters such that all points in one cluster are “close” to all points in the other.
    • 45. Pros and Cons
      • Single link:
        • confused by near overlap
        • chaining effect
      • Complete link
        • unnecessary splits of elongated point clouds
        • sensitive to outliers
      • Several other hierarchical methods known:
    • 46. Partitional methods: K-means
      • Criteria: minimize sum of square of distance
          • Between each point and centroid of the cluster.
          • Between each pair of points in the cluster
      • Algorithm:
        • Select initial partition with K clusters: random, first K, K separated points
        • Repeat until stabilization:
          • Assign each point to closest cluster center
          • Generate new cluster centers
          • Adjust clusters by merging/splitting
    • 47. Properties
      • May not reach global optima
      • Converges fast in practice: guaranteed for certain forms of optimization function
      • Complexity: O(KndI):
        • I number of iterations, n number of points, d number of dimensions, K number of clusters.
      • Database research on scalable algorithms:
        • Birch: one/two pass of data by keeping R-tree like index in memory [Sigmod 96]
    • 48. Model based clustering
      • Assume data generated from K probability distributions
      • Typically Gaussian distribution Soft or probabilistic version of K-means clustering
      • Need to find distribution parameters.
      • EM Algorithm
    • 49. EM Algorithm
      • Initialize K cluster centers
      • Iterate between two steps
        • E xpectation step: assign points to clusters
        • M aximation step: estimate model parameters
    • 50. Properties
      • May not reach global optima
      • Converges fast in practice: guaranteed for certain forms of optimization function
      • Complexity: O(KndI):
        • I number of iterations, n number of points, d number of dimensions, K number of clusters.
    • 51. Scalable clustering algorithms
      • Birch: one/two pass of data by keeping R-tree like index in memory [Sigmod 96]
      • Fayyad and Bradley: Sample repetitively and update summary of clusters stored in memory (K-mean and EM) [KDD 98]
      • Dasgupta 99: Recent theoretical breakthrough, find Gaussian clusters with guaranteed performance
        • Random projections
    • 52. Data Mining Challenges
      • Large databases
        • sampling, parallel processing, approximation
      • High dimensionality
        • feature selection, transformation
      • Overfitting
      • Data evolution
      • Missing/noisy data
      • Non-obvious patterns
      • User interaction/prior knowledge
      • Integration with other systems
    • 53. To Learn More
    • 54. Books
      • Ian H. Witten and Frank Eibe, Data mining : practical machine learning tools and techniques with Java implementations , Morgan Kaufmann, 1999
      • Usama Fayyad et al. (eds), Advances in Knowledge Discovery and Data Mining , AAAI/MIT Press, 1996
      • Tom Mitchell, Machine Learning , McGraw-Hill
    • 55. Software
      • Public domain
        • Weka 3: data mining algos in Java (http://www.cs.waikato.ac.nz/~ml/weka)
          • classification, regression
        • MLC++: data mining tools in C++
          • mainly classification
      • Free for universities
        • try convincing IBM to give it free!
      • Datasets: follow links from www.kdnuggets.com to UC Irvine site
    • 56. Resources
      • http://www.kdnuggets.com
        • Great site with links to software, datasets etc. Be sure to visit it.
      • http://www.cs.bham.ac.uk/~anp/TheDataMine.html
      • OLAP: http://altaplana.com/olap/
      • SIGKDD: http://www.acm.org/sigkdd
      • Data mining and knowledge discovery journal: http://www.research.microsoft.com/research/datamine/
      • Communications of ACM Special Issue on Data Mining, Nov 1996
    • 57. Resources at IITB
      • http://www.cse.iitb.ernet.in/~dbms
        • IITB DB group home page
      • http://www.it.iitb.ernet.in/~sunita/it642
        • Data Warehousing and Data Mining course offered by Prof. Sunita Sarawagi at IITB

    ×