Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

  • Be the first to comment

  • Be the first to like this


  1. 1. CSCI 548/B480: Introduction to Bioinformatics Fall 2002 Dr. Jeffrey Huang, Assistant Professor Department of Computer and Information Science, IUPUI E-mail: Topic 5: Machine Intelligence - Learning and Evolution
  2. 2. Machine Intelligence <ul><li>Machine Learning </li></ul><ul><ul><li>The subfield of AI concerned with intelligent systems that learn. </li></ul></ul><ul><ul><li>The computational study of algorithms that improve performance based on experience. </li></ul></ul><ul><li>The attempt to build intelligent entities: </li></ul><ul><ul><li>We must understand intelligent entities first </li></ul></ul><ul><ul><li>Computational Brain </li></ul></ul><ul><ul><li>Mathematics: </li></ul></ul><ul><ul><ul><li>Philosophy staked most of the ideas of AI but to make it a formal science the mathematical formalization is needed in </li></ul></ul></ul><ul><ul><ul><ul><li>Computation </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Logic </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Probability </li></ul></ul></ul></ul>
  3. 3. Behavior-Based AI vs. Knowledge Based <ul><li>Definitions of Machine Learning </li></ul><ul><ul><li>Reasoning </li></ul></ul><ul><ul><ul><li>The effort to make computers think and solve problem </li></ul></ul></ul><ul><ul><ul><li>The study of mental faculties through the use of computational models </li></ul></ul></ul><ul><ul><li>Behavior </li></ul></ul><ul><ul><ul><li>Make machines to perform human actions requiring intelligence </li></ul></ul></ul><ul><ul><ul><li>Seeks to explain intelligent behavior in terms of computational processes </li></ul></ul></ul><ul><li>Agents </li></ul>Environment percepts actions sensors effectors agent ?
  4. 4. Operational Agents <ul><li>Operational Views of Intelligence: </li></ul><ul><ul><li>The ability to perform intellectual tasks </li></ul></ul><ul><ul><ul><li>Prove theorems, play chess, solve puzzle </li></ul></ul></ul><ul><ul><ul><li>Focus on what goes on “between the ears” </li></ul></ul></ul><ul><ul><ul><li>Emphasize the ability to build and effectively use mental models </li></ul></ul></ul><ul><ul><li>The ability to perform intellectually challenging “real world” tasks </li></ul></ul><ul><ul><ul><li>Medical diagnosis, tax advising, financial investing </li></ul></ul></ul><ul><ul><ul><li>Introduce new issues such as: critical interactions with the world, model grounding, uncertainty </li></ul></ul></ul><ul><ul><li>The ability to survive, adapt, and function in a constantly changing world </li></ul></ul><ul><ul><ul><li>Autonomous agents </li></ul></ul></ul><ul><ul><ul><li>Vision, locomotion, and manipulation,… many I/O issues </li></ul></ul></ul><ul><ul><ul><li>Self-assessment, learning, curiosity, etc. </li></ul></ul></ul>
  5. 5. Building Intelligent Artifacts <ul><li>Symbolic Approaches: </li></ul><ul><ul><li>Construct goal-oriented symbol manipulation systems </li></ul></ul><ul><ul><li>Focus on high end abstract thinking </li></ul></ul><ul><li>Non-symbolic approaches: </li></ul><ul><ul><li>Build performance-oriented systems </li></ul></ul><ul><ul><li>Focus on behavior </li></ul></ul><ul><li>Need both in tightly coupled form </li></ul><ul><ul><li>Difficult in building such systems </li></ul></ul><ul><ul><li>Growing need to automate this process </li></ul></ul><ul><ul><li>Good approach: Evolutionary Algorithms </li></ul></ul>
  6. 6. <ul><li>Behavior-Based AI </li></ul><ul><ul><li>Behavior-Based AI vs. Knowledge-Based </li></ul></ul><ul><ul><li>&quot;Situated&quot; in environment </li></ul></ul><ul><ul><li>Multiple competencies ('routines') </li></ul></ul><ul><ul><li>Autonomy </li></ul></ul><ul><ul><li>Adaptation and Competition </li></ul></ul><ul><li>Artificial Life (A-Life) </li></ul><ul><ul><li>Agents: Reactive Behavior </li></ul></ul><ul><ul><li>Abstracting the logical principles of living organism </li></ul></ul><ul><ul><li>Collective Behavior : Competition and Cooperation </li></ul></ul>
  7. 7. <ul><li>Classification: </li></ul><ul><ul><li>predicts categorical class labels </li></ul></ul><ul><ul><li>classifies data (constructs a model) based on the training set and the values ( class labels ) in a classifying attribute and uses it in classifying new data </li></ul></ul><ul><li>Prediction: </li></ul><ul><ul><li>models continuous-valued functions, i.e., predicts unknown or missing values </li></ul></ul>Classification vs. Prediction
  8. 8. Classification—A Two-Step Process <ul><li>Model construction: describing a set of predetermined classes </li></ul><ul><ul><li>Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute </li></ul></ul><ul><ul><li>The set of tuples used for model construction: training set </li></ul></ul><ul><ul><li>The model is represented as classification rules, decision trees, or mathematical formulae </li></ul></ul><ul><li>Model usage: for classifying future or unknown objects </li></ul><ul><ul><li>Estimate accuracy of the model </li></ul></ul><ul><ul><ul><li>The known label of test sample is compared with the classified result from the model </li></ul></ul></ul><ul><ul><ul><li>Accuracy rate is the percentage of test set samples that are correctly classified by the model </li></ul></ul></ul><ul><ul><ul><li>Test set is independent of training set, otherwise over-fitting will occur </li></ul></ul></ul>
  9. 9. Classification Process Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Model Construction Use the Model in Prediction (Jeff, Professor, 2) Tenured? Training Data Classifier (Model) Testing Data Unseen Data Classifier (Model)
  10. 10. Supervised vs. Unsupervised Learning <ul><li>Supervised learning (classification) </li></ul><ul><ul><li>Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations </li></ul></ul><ul><ul><li>New data is classified based on the training set </li></ul></ul><ul><li>Unsupervised learning (clustering) </li></ul><ul><ul><li>The class labels of training data is unknown </li></ul></ul><ul><ul><li>Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data </li></ul></ul>
  11. 11. Classification and Prediction <ul><li>Data Preparation </li></ul><ul><ul><li>Data cleaning </li></ul></ul><ul><ul><ul><li>Preprocess data in order to reduce noise and handle missing values </li></ul></ul></ul><ul><ul><li>Relevance analysis (feature selection) </li></ul></ul><ul><ul><ul><li>Remove the irrelevant or redundant attributes </li></ul></ul></ul><ul><ul><li>Data transformation </li></ul></ul><ul><ul><ul><li>Generalize and/or normalize data </li></ul></ul></ul><ul><li>Evaluating Classification Methods </li></ul><ul><ul><li>Predictive accuracy </li></ul></ul><ul><ul><li>Speed and scalability </li></ul></ul><ul><ul><ul><li>time to construct the model </li></ul></ul></ul><ul><ul><ul><li>time to use the model </li></ul></ul></ul><ul><ul><li>Robustness : handling noise and missing values </li></ul></ul><ul><ul><li>Scalability : efficiency in disk-resident databases </li></ul></ul><ul><ul><li>Interpretability : understanding and insight provided by the model </li></ul></ul><ul><ul><li>Goodness of rules </li></ul></ul><ul><ul><ul><li>decision tree size </li></ul></ul></ul><ul><ul><ul><li>compactness of classification rules </li></ul></ul></ul>
  12. 12. From Learning to Evolutionary <ul><li>Optimization </li></ul><ul><ul><li>Accomplishing abstract task = Solving problem </li></ul></ul><ul><ul><li>= searching through a space of potential solution </li></ul></ul><ul><ul><li>finding the “best solution” </li></ul></ul><ul><ul><ul><li> an optimization process </li></ul></ul></ul><ul><ul><li>Classical Exhaustive Methods?? </li></ul></ul><ul><ul><li>Large Space?? Special machine learning technique </li></ul></ul><ul><li>Evolution Algorithms </li></ul><ul><ul><li>Stochastic Algorithms </li></ul></ul><ul><ul><li>Search methods model some phenomena: </li></ul></ul><ul><ul><ul><li>Genetic Inheritance </li></ul></ul></ul><ul><ul><ul><li>Darwinian strife for survival </li></ul></ul></ul>
  13. 13. <ul><li>“… the metaphor underlying genetic algorithms is that of natural evolution . In evolution, the problem each species faces is one of searching for beneficial adaptations to a complicated and changing environment . The ‘knowledge’ that each species has gained is embodied in the makeup of chromosomes of its members” </li></ul><ul><ul><li>- L. David and M. Steenstrup, “Genetic Algorithms and Simulated Annealing”, pp. 1-11, Kaufmann, 1987 </li></ul></ul>
  14. 14. The Essence Components <ul><ul><li>Genetic representation for potential solutions to the problem </li></ul></ul><ul><ul><li>A way to create an Initial population of potential solutions </li></ul></ul><ul><ul><li>An evaluation function that plays the ole of the environment, rating solutions in term of their “fitness” </li></ul></ul><ul><ul><ul><li>i.e. the use of fitness to determine survival and reproductive rates </li></ul></ul></ul><ul><ul><li>Genetic operators that alter the composition of children </li></ul></ul>
  15. 15. Evolutionary Algorithm Search Procedure Randomly generate an initial population M(0) Compute and save the fitness u(m) for each individual m in the current population M(t) Define selection probabilities p(m) for each individual m in M(t) so that p(m) is proportional to u(m) Generate M(t+1) by probabilitically selecting individuals to produce offspring via genetic operations ( Crossover and mutation )
  16. 16. Historical Background <ul><li>Three paradigms emerged in the 1960s: </li></ul><ul><ul><li>Genetic Algorithms </li></ul></ul><ul><ul><ul><li>Introduced by Holland (MSU)  De Jong (GMU) </li></ul></ul></ul><ul><ul><ul><li>Envisioned for broad range of “adaptive systems” </li></ul></ul></ul><ul><ul><li>Evolution Strategies </li></ul></ul><ul><ul><ul><li>Introduced by Rechenberg </li></ul></ul></ul><ul><ul><ul><li>Focused on real-valued parameter optimization </li></ul></ul></ul><ul><ul><li>Evolutionary Programming </li></ul></ul><ul><ul><ul><li>Introduced by Fogel and Koza </li></ul></ul></ul><ul><ul><ul><li>Applied to AI and machine learning problem </li></ul></ul></ul><ul><li>Today: </li></ul><ul><ul><li>Wide variety of evolutionary algorithms </li></ul></ul><ul><ul><li>Applied to many area of science and engineering </li></ul></ul>
  17. 17. Examples of Evolutionary AI <ul><li>Parameter Tuning </li></ul><ul><ul><li>Pervasiveness of parameterized models </li></ul></ul><ul><ul><li>Complex behavioral changes due to non-linear interactions </li></ul></ul><ul><ul><li>Example: </li></ul></ul><ul><ul><ul><li>Weights of an Artificial Neural networks </li></ul></ul></ul><ul><ul><ul><li>Parameters of a heuristic evolution function </li></ul></ul></ul><ul><ul><ul><li>Parameter of a rule induction system </li></ul></ul></ul><ul><ul><ul><li>Parameter of membership functions </li></ul></ul></ul><ul><ul><li>Goal: evolve over time useful set of discrete/ continuous parameter </li></ul></ul>
  18. 18. <ul><li>Evolving Structure </li></ul><ul><ul><li>Effect behavior change via more complex structures </li></ul></ul><ul><ul><li>Example: </li></ul></ul><ul><ul><ul><li>Selecting/constructing the topology of ANNs </li></ul></ul></ul><ul><ul><ul><li>Selecting/constructing the feature sets </li></ul></ul></ul><ul><ul><ul><li>Selecting/constructing plans/scenarios </li></ul></ul></ul><ul><ul><ul><li>Selecting/constructing membership functions </li></ul></ul></ul><ul><ul><li>Goal: evolve useful structure over time </li></ul></ul><ul><li>Evolving Programs </li></ul><ul><ul><li>Goal: acquire new behaviors and adapt existing ones </li></ul></ul><ul><ul><li>Example: </li></ul></ul><ul><ul><ul><li>Acquire/adapt behavioral rules sets </li></ul></ul></ul><ul><ul><ul><li>Acquire/adapt arm/joint control programs </li></ul></ul></ul><ul><ul><ul><li>Acquire/adapt task-oriented programming code </li></ul></ul></ul>
  19. 19. How Does Genetic Algorithm Work? <ul><li>A simple example of function optimization </li></ul><ul><ul><li>Find max f(x)=x 2 , for x  [0 , 4] </li></ul></ul><ul><ul><li>Representation: </li></ul></ul><ul><ul><ul><li>Genotype (chromosome) : internally points in the search space are represented as (binary) string over some alphabet </li></ul></ul></ul><ul><ul><ul><li>Phenotype : the expressed traits of an individual </li></ul></ul></ul><ul><ul><ul><li>With a precision for x in [0,4] of 10 -4 : it needs14 bits </li></ul></ul></ul><ul><ul><ul><ul><li>8,000  2 13 < 10,000 < 2 14  16,000 </li></ul></ul></ul></ul><ul><ul><ul><li>Simple fixed length binary </li></ul></ul></ul><ul><ul><ul><ul><li>Assigned 0.0 to the string 00 0000 0000 0000 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Assign 0.0 + bin2dec(binary string)*4/(2 14 -1) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>the string 00 0000 0000 0001 and so on </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Phenotype 4.0 = genotype 11 1111 1111 1111 </li></ul></ul></ul></ul>
  20. 20. <ul><ul><li>Initial population: </li></ul></ul><ul><ul><ul><li>Create a population ( pop_size ) of chromosomes, where each chromosome is a binary vector of 14 bits </li></ul></ul></ul><ul><ul><ul><li>All 14 bits for each chromosome are initialized randomly </li></ul></ul></ul><ul><ul><li>Evaluation function </li></ul></ul><ul><ul><ul><li>Evaluation function eval for binary vectors v is equal to the function f : </li></ul></ul></ul><ul><ul><ul><li>eval( v ) = f(x) </li></ul></ul></ul><ul><ul><ul><li>ex; eval( v 1 )= f(x 1 ) = fitness 1 </li></ul></ul></ul>00000000000000 00000000000001 … … 11111111111111 0.0 4/(2 14 -1) … … 4.0 genotype Phenotype v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 v 9 v 10 v 11 v 12 v 13 v 14 v 15 v 16 v 17 v 18 v 19 v 20 v 21 v 22 v 23 v 24
  21. 21. <ul><ul><li>Parameters </li></ul></ul><ul><ul><ul><li>pop_size = 24 , </li></ul></ul></ul><ul><ul><ul><li>Prob. of Xover, p c = 0.6 , </li></ul></ul></ul><ul><ul><ul><li>Prob. of mutation, p m = 0.01 </li></ul></ul></ul><ul><ul><li>Recombination: using genetic operations </li></ul></ul><ul><ul><ul><li>Crossover ( p c ) </li></ul></ul></ul><ul><ul><ul><li> v 1 = 0111 1100010011 => v 1 ’= 0111 0101011100 </li></ul></ul></ul><ul><ul><ul><li> v 2 = 0001 0101011100 => v 2 ’= 0001 1100010011 </li></ul></ul></ul><ul><ul><ul><li>Mutation ( p m ) </li></ul></ul></ul><ul><ul><ul><li> v 2 ’= 000111 0 0010011 => v 2 ”= 00011 1 10010011 </li></ul></ul></ul>
  22. 22. <ul><ul><li>Selection M(t) from M(t+1): using roulette wheel </li></ul></ul><ul><ul><ul><li>Total fitness of the population </li></ul></ul></ul><ul><ul><ul><li>Probability of selection prob i for each chromosome v i </li></ul></ul></ul><ul><ul><ul><li>Cumulative prob q i </li></ul></ul></ul><ul><ul><ul><li>Generate random numbers r j , from [0,1], where j = 1 … pop_size </li></ul></ul></ul><ul><ul><ul><li>Select chromosome v i such that q i-1 < r j <= q i </li></ul></ul></ul>
  23. 24. Homing to the Optimal Solution
  24. 25. Best-so-far Curve
  25. 26. Optimal Feature Subset <ul><li>Search for the Subsets of Discriminatory Features </li></ul><ul><ul><li>Combination optimization problem </li></ul></ul><ul><ul><li>Two general approaches to identifying optimal subsets of features: </li></ul></ul><ul><ul><ul><li>Abstract measurement for important properties of good feature sets </li></ul></ul></ul><ul><ul><ul><ul><li>Orthogonality (ex. PCA), information content, low variance </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Less expensive process </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Fall in suboptimal performance if the abstract measures do not correlate well with actual performance </li></ul></ul></ul></ul><ul><ul><ul><li>Building a classifier from the feature subset and evaluating its performance on actual classification tasks. </li></ul></ul></ul><ul><ul><ul><ul><li>Better classification performance </li></ul></ul></ul></ul><ul><ul><ul><ul><li>the cost of building and testing classifiers prohibits any kind of systematic evaluation of feature subsets </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>suboptimal in practice: large numbers of candidate features cannot be handled by any form of systematic search </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>2 N possible candidate subsets of N features. </li></ul></ul></ul></ul></ul>
  26. 27. Inductive Learning <ul><li>Learning From Examples </li></ul><ul><ul><li>Decision Tree (DT): </li></ul></ul><ul><ul><li>Information Theory (IT) </li></ul></ul><ul><ul><li>Question: what are the BEST attributes (Features) for building the decision tree? </li></ul></ul><ul><ul><li>Answer: ‘BEST’ attribute is the one that it is ‘MOST’ informative and for whom ‘ambiguity/uncertainty’ is least </li></ul></ul><ul><ul><li>Solution: Measure (information) contents using the expected amount of information provided by the attribute </li></ul></ul>
  27. 28. Classification by Decision Tree Induction <ul><li>Decision tree </li></ul><ul><ul><li>A flow-chart-like tree structure </li></ul></ul><ul><ul><li>Internal node denotes a test on an attribute </li></ul></ul><ul><ul><li>Branch represents an outcome of the test </li></ul></ul><ul><ul><li>Leaf nodes represent class labels or class </li></ul></ul><ul><ul><li>distribution </li></ul></ul><ul><li>Decision tree generation consists of two phases </li></ul><ul><ul><li>Tree construction </li></ul></ul><ul><ul><ul><li>At start, all the training examples are at the root </li></ul></ul></ul><ul><ul><ul><li>Partition examples recursively based on selected attributes </li></ul></ul></ul><ul><ul><li>Tree pruning </li></ul></ul><ul><ul><ul><li>Identify and remove branches that reflect noise or outliers </li></ul></ul></ul><ul><li>Use of decision tree: Classifying an unknown sample </li></ul><ul><ul><li>Test the attribute values of the sample against the decision tree </li></ul></ul>6 5 4 3 2 1 Exs. Smooth Yellow Medium B Smooth Yellow Medium B Rough Red Big A Smooth Red Medium A Smooth Red Medium A Smooth Yellow Small A Surface Color Size Class color yellow red A size small medium B A
  28. 29. <ul><li>Entropy </li></ul><ul><ul><li>Define an entropy function H such that </li></ul></ul><ul><ul><li>where p i : the probability associated with i th class </li></ul></ul><ul><ul><li>For a feature, the entropy is calculated for each value. </li></ul></ul><ul><ul><li>The sum of the entropy weighted by the probability of each value is the entropy for that feature </li></ul></ul><ul><ul><li>Example: Toss a fair coin </li></ul></ul><ul><ul><li>if the coin is not fair, i.e. P heads = 99%, then </li></ul></ul><ul><ul><li>So, by tossing the coin you get very little (extra) information (that you didn’t expect) </li></ul></ul>
  29. 30. <ul><ul><li>In general, if you have p positive examples, and n negative examples </li></ul></ul><ul><ul><ul><li>For p = n  H = 1 </li></ul></ul></ul><ul><ul><ul><li>i.e. originally there is most uncertainty on the eventual outcome (picking up an example) and most to gain by picking the example. </li></ul></ul></ul>
  30. 31. Decision Tree Induction <ul><li>Basic algorithm (a greedy algorithm) </li></ul><ul><ul><li>Tree is constructed in a top-down recursive divide-and-conquer manner </li></ul></ul><ul><ul><li>At start, all the training examples are at the root </li></ul></ul><ul><ul><li>Attributes are categorical ( if continuous-valued, they are discretized in advance ) </li></ul></ul><ul><ul><li>Examples are partitioned recursively based on selected attributes </li></ul></ul><ul><ul><li>Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain ) </li></ul></ul><ul><li>Conditions for stopping partitioning </li></ul><ul><ul><li>All samples for a given node belong to the same class </li></ul></ul><ul><ul><li>There are no remaining attributes for further partitioning </li></ul></ul><ul><ul><li>Majority voting is employed for classifying the leaf </li></ul></ul><ul><ul><li>There are no samples left </li></ul></ul>
  31. 32. Algorithm <ul><li>Select a random subset W (called the window) from the training set T </li></ul><ul><li>Build a DT for the current W </li></ul><ul><ul><li>Select the best feature which minimizes the entropy H (or max. gain) </li></ul></ul><ul><ul><li>Categorize training instances (examples) into subsets by this feature </li></ul></ul><ul><ul><li>Repeat this process recursively until each subset contains instances of one kind (class) or some statistical criterion is satisfied </li></ul></ul><ul><li>Scan the entire training set for exceptions to the DT </li></ul><ul><li>If exceptions are found insert some of them into W and repeat from step 2 </li></ul>
  32. 33. <ul><li>Information Gain </li></ul><ul><ul><li>The information gain from the  attribute test is defined as the difference between the original information requirement and the new requirement </li></ul></ul><ul><ul><ul><li>Note that the Remainder(  ) is an weighted (by attribute values) entropy function </li></ul></ul></ul><ul><ul><li>Maximize Gain(  )  Minimize Remainder(  ) ; and then  is the most informative attribute (‘question’) </li></ul></ul>
  33. 34. The ID3 Algorithm and Quinlan’s C4.5 <ul><li>C4.5 </li></ul><ul><ul><li>Tutorial: </li></ul></ul><ul><ul><li>Matlab program: </li></ul></ul><ul><li>See 5/ C5.0 </li></ul><ul><ul><li>Tutorial: </li></ul></ul><ul><ul><li>Software for Win2000: </li></ul></ul>
  34. 35. <ul><ul><li>Example: </li></ul></ul>6 5 4 3 2 1 Exs. Smooth Yellow Medium B Smooth Yellow Medium B Rough Red Big A Smooth Red Medium A Smooth Red Medium A Smooth Yellow Small A Surface Color Size Class color yellow red A size small medium B A color yellow red A ?
  35. 36. <ul><li>Noise and Overfitting </li></ul><ul><ul><li>Question: what about two or more examples with the same description but different classifications? </li></ul></ul><ul><ul><li>Answer: Each leaf node reports either MAJORITY classification or relative frequencies </li></ul></ul><ul><ul><li>Question: what about irrelevant attributes (noise and overfitting)? </li></ul></ul><ul><ul><li>Answer: Tree pruning </li></ul></ul><ul><ul><li>Solution: An information gain close to zero is a good clue to irrelevance, actual number of (+) and (-) exs. In each subset i, p i and n i vs. expected numbers p i and n i assuming true irrelevance </li></ul></ul><ul><ul><li>Where p and n are the total number of positive and negative exs to start with. </li></ul></ul><ul><ul><li>Total deviation (regarding statistical significant) </li></ul></ul><ul><ul><li>Under the null hypothesis, D ~ chi-squared distribution </li></ul></ul>
  36. 37. Extracting Classification Rules from Trees <ul><li>Represent the knowledge in the form of IF-THEN rules </li></ul><ul><li>One rule is created for each path from the root to a leaf </li></ul><ul><li>Each attribute-value pair along a path forms a conjunction </li></ul><ul><li>The leaf node holds the class prediction </li></ul><ul><li>Rules are easier for humans to understand </li></ul><ul><li>Example </li></ul><ul><ul><li>IF age = “<=30” AND student = “ no ” THEN buys_computer = “ no ” </li></ul></ul><ul><ul><li>IF age = “<=30” AND student = “ yes ” THEN buys_computer = “ yes ” </li></ul></ul><ul><ul><li>IF age = “31…40” THEN buys_computer = “ yes ” </li></ul></ul><ul><ul><li>IF age = “>40” AND credit_rating = “ excellent ” THEN buys_computer = “ yes ” </li></ul></ul><ul><ul><li>IF age = “>40” AND credit_rating = “ fair ” THEN buys_computer = “ no ” </li></ul></ul>
  37. 38. Decision Tree <ul><li>Avoid Overfitting in Classification </li></ul><ul><ul><li>The generated tree may overfit the training data </li></ul></ul><ul><ul><ul><li>Too many branches, some may reflect anomalies due to noise or outliers </li></ul></ul></ul><ul><ul><ul><li>Result is in poor accuracy for unseen samples </li></ul></ul></ul><ul><ul><li>Two approaches to avoid overfitting </li></ul></ul><ul><ul><ul><li>Prepruning: Halt tree construction early — do not split a node if this would result in the goodness measure falling below a threshold </li></ul></ul></ul><ul><ul><ul><ul><li>Difficult to choose an appropriate threshold </li></ul></ul></ul></ul><ul><ul><ul><li>Postpruning: Remove branches from a “fully grown” tree — get a sequence of progressively pruned trees </li></ul></ul></ul><ul><ul><ul><ul><li>Use a set of data different from the training data to decide which is the “best pruned tree” </li></ul></ul></ul></ul>
  38. 39. <ul><li>Approaches to Determine the Final Tree Size </li></ul><ul><ul><li>Separate training (2/3) and testing (1/3) sets </li></ul></ul><ul><ul><li>Use cross validation, e.g., 10-fold cross validation </li></ul></ul><ul><ul><li>Use all the data for training </li></ul></ul><ul><ul><ul><li>but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution </li></ul></ul></ul><ul><ul><li>Use minimum description length (MDL) principle: </li></ul></ul><ul><ul><ul><li>halting growth of the tree when the encoding is minimized </li></ul></ul></ul>
  39. 40. Decision Tree <ul><li>Enhancements to basic decision tree induction </li></ul><ul><ul><li>Allow for continuous-valued attributes </li></ul></ul><ul><ul><ul><li>Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals </li></ul></ul></ul><ul><ul><li>Handle missing attribute values </li></ul></ul><ul><ul><ul><li>Assign the most common value of the attribute </li></ul></ul></ul><ul><ul><ul><li>Assign probability to each of the possible values </li></ul></ul></ul><ul><ul><li>Attribute construction </li></ul></ul><ul><ul><ul><li>Create new attributes based on existing ones that are sparsely represented </li></ul></ul></ul><ul><ul><ul><li>This reduces fragmentation, repetition, and replication </li></ul></ul></ul>