Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Topic_6

345 views

Published on

  • Be the first to comment

  • Be the first to like this

Topic_6

  1. 1. CSCI 548/B480: Introduction to Bioinformatics Fall 2002 Dr. Jeffrey Huang, Assistant Professor Department of Computer and Information Science, IUPUI E-mail: huang@cs.iupui.edu Topic 5: Machine Intelligence - Learning and Evolution
  2. 2. Machine Intelligence <ul><li>Machine Learning </li></ul><ul><ul><li>The subfield of AI concerned with intelligent systems that learn. </li></ul></ul><ul><ul><li>The computational study of algorithms that improve performance based on experience. </li></ul></ul><ul><li>The attempt to build intelligent entities: </li></ul><ul><ul><li>We must understand intelligent entities first </li></ul></ul><ul><ul><li>Computational Brain </li></ul></ul><ul><ul><li>Mathematics: </li></ul></ul><ul><ul><ul><li>Philosophy staked most of the ideas of AI but to make it a formal science the mathematical formalization is needed in </li></ul></ul></ul><ul><ul><ul><ul><li>Computation </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Logic </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Probability </li></ul></ul></ul></ul>
  3. 3. Behavior-Based AI vs. Knowledge Based <ul><li>Definitions of Machine Learning </li></ul><ul><ul><li>Reasoning </li></ul></ul><ul><ul><ul><li>The effort to make computers think and solve problem </li></ul></ul></ul><ul><ul><ul><li>The study of mental faculties through the use of computational models </li></ul></ul></ul><ul><ul><li>Behavior </li></ul></ul><ul><ul><ul><li>Make machines to perform human actions requiring intelligence </li></ul></ul></ul><ul><ul><ul><li>Seeks to explain intelligent behavior in terms of computational processes </li></ul></ul></ul><ul><li>Agents </li></ul>Environment percepts actions sensors effectors agent ?
  4. 4. Operational Agents <ul><li>Operational Views of Intelligence: </li></ul><ul><ul><li>The ability to perform intellectual tasks </li></ul></ul><ul><ul><ul><li>Prove theorems, play chess, solve puzzle </li></ul></ul></ul><ul><ul><ul><li>Focus on what goes on “between the ears” </li></ul></ul></ul><ul><ul><ul><li>Emphasize the ability to build and effectively use mental models </li></ul></ul></ul><ul><ul><li>The ability to perform intellectually challenging “real world” tasks </li></ul></ul><ul><ul><ul><li>Medical diagnosis, tax advising, financial investing </li></ul></ul></ul><ul><ul><ul><li>Introduce new issues such as: critical interactions with the world, model grounding, uncertainty </li></ul></ul></ul><ul><ul><li>The ability to survive, adapt, and function in a constantly changing world </li></ul></ul><ul><ul><ul><li>Autonomous agents </li></ul></ul></ul><ul><ul><ul><li>Vision, locomotion, and manipulation,… many I/O issues </li></ul></ul></ul><ul><ul><ul><li>Self-assessment, learning, curiosity, etc. </li></ul></ul></ul>
  5. 5. Building Intelligent Artifacts <ul><li>Symbolic Approaches: </li></ul><ul><ul><li>Construct goal-oriented symbol manipulation systems </li></ul></ul><ul><ul><li>Focus on high end abstract thinking </li></ul></ul><ul><li>Non-symbolic approaches: </li></ul><ul><ul><li>Build performance-oriented systems </li></ul></ul><ul><ul><li>Focus on behavior </li></ul></ul><ul><li>Need both in tightly coupled form </li></ul><ul><ul><li>Difficult in building such systems </li></ul></ul><ul><ul><li>Growing need to automate this process </li></ul></ul><ul><ul><li>Good approach: Evolutionary Algorithms </li></ul></ul>
  6. 6. <ul><li>Behavior-Based AI </li></ul><ul><ul><li>Behavior-Based AI vs. Knowledge-Based </li></ul></ul><ul><ul><li>&quot;Situated&quot; in environment </li></ul></ul><ul><ul><li>Multiple competencies ('routines') </li></ul></ul><ul><ul><li>Autonomy </li></ul></ul><ul><ul><li>Adaptation and Competition </li></ul></ul><ul><li>Artificial Life (A-Life) </li></ul><ul><ul><li>Agents: Reactive Behavior </li></ul></ul><ul><ul><li>Abstracting the logical principles of living organism </li></ul></ul><ul><ul><li>Collective Behavior : Competition and Cooperation </li></ul></ul>
  7. 7. <ul><li>Classification: </li></ul><ul><ul><li>predicts categorical class labels </li></ul></ul><ul><ul><li>classifies data (constructs a model) based on the training set and the values ( class labels ) in a classifying attribute and uses it in classifying new data </li></ul></ul><ul><li>Prediction: </li></ul><ul><ul><li>models continuous-valued functions, i.e., predicts unknown or missing values </li></ul></ul>Classification vs. Prediction
  8. 8. Classification—A Two-Step Process <ul><li>Model construction: describing a set of predetermined classes </li></ul><ul><ul><li>Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute </li></ul></ul><ul><ul><li>The set of tuples used for model construction: training set </li></ul></ul><ul><ul><li>The model is represented as classification rules, decision trees, or mathematical formulae </li></ul></ul><ul><li>Model usage: for classifying future or unknown objects </li></ul><ul><ul><li>Estimate accuracy of the model </li></ul></ul><ul><ul><ul><li>The known label of test sample is compared with the classified result from the model </li></ul></ul></ul><ul><ul><ul><li>Accuracy rate is the percentage of test set samples that are correctly classified by the model </li></ul></ul></ul><ul><ul><ul><li>Test set is independent of training set, otherwise over-fitting will occur </li></ul></ul></ul>
  9. 9. Classification Process Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Model Construction Use the Model in Prediction (Jeff, Professor, 2) Tenured? Training Data Classifier (Model) Testing Data Unseen Data Classifier (Model)
  10. 10. Supervised vs. Unsupervised Learning <ul><li>Supervised learning (classification) </li></ul><ul><ul><li>Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations </li></ul></ul><ul><ul><li>New data is classified based on the training set </li></ul></ul><ul><li>Unsupervised learning (clustering) </li></ul><ul><ul><li>The class labels of training data is unknown </li></ul></ul><ul><ul><li>Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data </li></ul></ul>
  11. 11. Classification and Prediction <ul><li>Data Preparation </li></ul><ul><ul><li>Data cleaning </li></ul></ul><ul><ul><ul><li>Preprocess data in order to reduce noise and handle missing values </li></ul></ul></ul><ul><ul><li>Relevance analysis (feature selection) </li></ul></ul><ul><ul><ul><li>Remove the irrelevant or redundant attributes </li></ul></ul></ul><ul><ul><li>Data transformation </li></ul></ul><ul><ul><ul><li>Generalize and/or normalize data </li></ul></ul></ul><ul><li>Evaluating Classification Methods </li></ul><ul><ul><li>Predictive accuracy </li></ul></ul><ul><ul><li>Speed and scalability </li></ul></ul><ul><ul><ul><li>time to construct the model </li></ul></ul></ul><ul><ul><ul><li>time to use the model </li></ul></ul></ul><ul><ul><li>Robustness : handling noise and missing values </li></ul></ul><ul><ul><li>Scalability : efficiency in disk-resident databases </li></ul></ul><ul><ul><li>Interpretability : understanding and insight provided by the model </li></ul></ul><ul><ul><li>Goodness of rules </li></ul></ul><ul><ul><ul><li>decision tree size </li></ul></ul></ul><ul><ul><ul><li>compactness of classification rules </li></ul></ul></ul>
  12. 12. From Learning to Evolutionary <ul><li>Optimization </li></ul><ul><ul><li>Accomplishing abstract task = Solving problem </li></ul></ul><ul><ul><li>= searching through a space of potential solution </li></ul></ul><ul><ul><li>finding the “best solution” </li></ul></ul><ul><ul><ul><li> an optimization process </li></ul></ul></ul><ul><ul><li>Classical Exhaustive Methods?? </li></ul></ul><ul><ul><li>Large Space?? Special machine learning technique </li></ul></ul><ul><li>Evolution Algorithms </li></ul><ul><ul><li>Stochastic Algorithms </li></ul></ul><ul><ul><li>Search methods model some phenomena: </li></ul></ul><ul><ul><ul><li>Genetic Inheritance </li></ul></ul></ul><ul><ul><ul><li>Darwinian strife for survival </li></ul></ul></ul>
  13. 13. <ul><li>“… the metaphor underlying genetic algorithms is that of natural evolution . In evolution, the problem each species faces is one of searching for beneficial adaptations to a complicated and changing environment . The ‘knowledge’ that each species has gained is embodied in the makeup of chromosomes of its members” </li></ul><ul><ul><li>- L. David and M. Steenstrup, “Genetic Algorithms and Simulated Annealing”, pp. 1-11, Kaufmann, 1987 </li></ul></ul>
  14. 14. The Essence Components <ul><ul><li>Genetic representation for potential solutions to the problem </li></ul></ul><ul><ul><li>A way to create an Initial population of potential solutions </li></ul></ul><ul><ul><li>An evaluation function that plays the ole of the environment, rating solutions in term of their “fitness” </li></ul></ul><ul><ul><ul><li>i.e. the use of fitness to determine survival and reproductive rates </li></ul></ul></ul><ul><ul><li>Genetic operators that alter the composition of children </li></ul></ul>
  15. 15. Evolutionary Algorithm Search Procedure Randomly generate an initial population M(0) Compute and save the fitness u(m) for each individual m in the current population M(t) Define selection probabilities p(m) for each individual m in M(t) so that p(m) is proportional to u(m) Generate M(t+1) by probabilitically selecting individuals to produce offspring via genetic operations ( Crossover and mutation )
  16. 16. Historical Background <ul><li>Three paradigms emerged in the 1960s: </li></ul><ul><ul><li>Genetic Algorithms </li></ul></ul><ul><ul><ul><li>Introduced by Holland (MSU)  De Jong (GMU) </li></ul></ul></ul><ul><ul><ul><li>Envisioned for broad range of “adaptive systems” </li></ul></ul></ul><ul><ul><li>Evolution Strategies </li></ul></ul><ul><ul><ul><li>Introduced by Rechenberg </li></ul></ul></ul><ul><ul><ul><li>Focused on real-valued parameter optimization </li></ul></ul></ul><ul><ul><li>Evolutionary Programming </li></ul></ul><ul><ul><ul><li>Introduced by Fogel and Koza </li></ul></ul></ul><ul><ul><ul><li>Applied to AI and machine learning problem </li></ul></ul></ul><ul><li>Today: </li></ul><ul><ul><li>Wide variety of evolutionary algorithms </li></ul></ul><ul><ul><li>Applied to many area of science and engineering </li></ul></ul>
  17. 17. Examples of Evolutionary AI <ul><li>Parameter Tuning </li></ul><ul><ul><li>Pervasiveness of parameterized models </li></ul></ul><ul><ul><li>Complex behavioral changes due to non-linear interactions </li></ul></ul><ul><ul><li>Example: </li></ul></ul><ul><ul><ul><li>Weights of an Artificial Neural networks </li></ul></ul></ul><ul><ul><ul><li>Parameters of a heuristic evolution function </li></ul></ul></ul><ul><ul><ul><li>Parameter of a rule induction system </li></ul></ul></ul><ul><ul><ul><li>Parameter of membership functions </li></ul></ul></ul><ul><ul><li>Goal: evolve over time useful set of discrete/ continuous parameter </li></ul></ul>
  18. 18. <ul><li>Evolving Structure </li></ul><ul><ul><li>Effect behavior change via more complex structures </li></ul></ul><ul><ul><li>Example: </li></ul></ul><ul><ul><ul><li>Selecting/constructing the topology of ANNs </li></ul></ul></ul><ul><ul><ul><li>Selecting/constructing the feature sets </li></ul></ul></ul><ul><ul><ul><li>Selecting/constructing plans/scenarios </li></ul></ul></ul><ul><ul><ul><li>Selecting/constructing membership functions </li></ul></ul></ul><ul><ul><li>Goal: evolve useful structure over time </li></ul></ul><ul><li>Evolving Programs </li></ul><ul><ul><li>Goal: acquire new behaviors and adapt existing ones </li></ul></ul><ul><ul><li>Example: </li></ul></ul><ul><ul><ul><li>Acquire/adapt behavioral rules sets </li></ul></ul></ul><ul><ul><ul><li>Acquire/adapt arm/joint control programs </li></ul></ul></ul><ul><ul><ul><li>Acquire/adapt task-oriented programming code </li></ul></ul></ul>
  19. 19. How Does Genetic Algorithm Work? <ul><li>A simple example of function optimization </li></ul><ul><ul><li>Find max f(x)=x 2 , for x  [0 , 4] </li></ul></ul><ul><ul><li>Representation: </li></ul></ul><ul><ul><ul><li>Genotype (chromosome) : internally points in the search space are represented as (binary) string over some alphabet </li></ul></ul></ul><ul><ul><ul><li>Phenotype : the expressed traits of an individual </li></ul></ul></ul><ul><ul><ul><li>With a precision for x in [0,4] of 10 -4 : it needs14 bits </li></ul></ul></ul><ul><ul><ul><ul><li>8,000  2 13 < 10,000 < 2 14  16,000 </li></ul></ul></ul></ul><ul><ul><ul><li>Simple fixed length binary </li></ul></ul></ul><ul><ul><ul><ul><li>Assigned 0.0 to the string 00 0000 0000 0000 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Assign 0.0 + bin2dec(binary string)*4/(2 14 -1) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>the string 00 0000 0000 0001 and so on </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Phenotype 4.0 = genotype 11 1111 1111 1111 </li></ul></ul></ul></ul>
  20. 20. <ul><ul><li>Initial population: </li></ul></ul><ul><ul><ul><li>Create a population ( pop_size ) of chromosomes, where each chromosome is a binary vector of 14 bits </li></ul></ul></ul><ul><ul><ul><li>All 14 bits for each chromosome are initialized randomly </li></ul></ul></ul><ul><ul><li>Evaluation function </li></ul></ul><ul><ul><ul><li>Evaluation function eval for binary vectors v is equal to the function f : </li></ul></ul></ul><ul><ul><ul><li>eval( v ) = f(x) </li></ul></ul></ul><ul><ul><ul><li>ex; eval( v 1 )= f(x 1 ) = fitness 1 </li></ul></ul></ul>00000000000000 00000000000001 … … 11111111111111 0.0 4/(2 14 -1) … … 4.0 genotype Phenotype v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 v 9 v 10 v 11 v 12 v 13 v 14 v 15 v 16 v 17 v 18 v 19 v 20 v 21 v 22 v 23 v 24
  21. 21. <ul><ul><li>Parameters </li></ul></ul><ul><ul><ul><li>pop_size = 24 , </li></ul></ul></ul><ul><ul><ul><li>Prob. of Xover, p c = 0.6 , </li></ul></ul></ul><ul><ul><ul><li>Prob. of mutation, p m = 0.01 </li></ul></ul></ul><ul><ul><li>Recombination: using genetic operations </li></ul></ul><ul><ul><ul><li>Crossover ( p c ) </li></ul></ul></ul><ul><ul><ul><li> v 1 = 0111 1100010011 => v 1 ’= 0111 0101011100 </li></ul></ul></ul><ul><ul><ul><li> v 2 = 0001 0101011100 => v 2 ’= 0001 1100010011 </li></ul></ul></ul><ul><ul><ul><li>Mutation ( p m ) </li></ul></ul></ul><ul><ul><ul><li> v 2 ’= 000111 0 0010011 => v 2 ”= 00011 1 10010011 </li></ul></ul></ul>
  22. 22. <ul><ul><li>Selection M(t) from M(t+1): using roulette wheel </li></ul></ul><ul><ul><ul><li>Total fitness of the population </li></ul></ul></ul><ul><ul><ul><li>Probability of selection prob i for each chromosome v i </li></ul></ul></ul><ul><ul><ul><li>Cumulative prob q i </li></ul></ul></ul><ul><ul><ul><li>Generate random numbers r j , from [0,1], where j = 1 … pop_size </li></ul></ul></ul><ul><ul><ul><li>Select chromosome v i such that q i-1 < r j <= q i </li></ul></ul></ul>
  23. 24. Homing to the Optimal Solution
  24. 25. Best-so-far Curve
  25. 26. Optimal Feature Subset <ul><li>Search for the Subsets of Discriminatory Features </li></ul><ul><ul><li>Combination optimization problem </li></ul></ul><ul><ul><li>Two general approaches to identifying optimal subsets of features: </li></ul></ul><ul><ul><ul><li>Abstract measurement for important properties of good feature sets </li></ul></ul></ul><ul><ul><ul><ul><li>Orthogonality (ex. PCA), information content, low variance </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Less expensive process </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Fall in suboptimal performance if the abstract measures do not correlate well with actual performance </li></ul></ul></ul></ul><ul><ul><ul><li>Building a classifier from the feature subset and evaluating its performance on actual classification tasks. </li></ul></ul></ul><ul><ul><ul><ul><li>Better classification performance </li></ul></ul></ul></ul><ul><ul><ul><ul><li>the cost of building and testing classifiers prohibits any kind of systematic evaluation of feature subsets </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>suboptimal in practice: large numbers of candidate features cannot be handled by any form of systematic search </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>2 N possible candidate subsets of N features. </li></ul></ul></ul></ul></ul>
  26. 27. Inductive Learning <ul><li>Learning From Examples </li></ul><ul><ul><li>Decision Tree (DT): </li></ul></ul><ul><ul><li>Information Theory (IT) </li></ul></ul><ul><ul><li>Question: what are the BEST attributes (Features) for building the decision tree? </li></ul></ul><ul><ul><li>Answer: ‘BEST’ attribute is the one that it is ‘MOST’ informative and for whom ‘ambiguity/uncertainty’ is least </li></ul></ul><ul><ul><li>Solution: Measure (information) contents using the expected amount of information provided by the attribute </li></ul></ul>
  27. 28. Classification by Decision Tree Induction <ul><li>Decision tree </li></ul><ul><ul><li>A flow-chart-like tree structure </li></ul></ul><ul><ul><li>Internal node denotes a test on an attribute </li></ul></ul><ul><ul><li>Branch represents an outcome of the test </li></ul></ul><ul><ul><li>Leaf nodes represent class labels or class </li></ul></ul><ul><ul><li>distribution </li></ul></ul><ul><li>Decision tree generation consists of two phases </li></ul><ul><ul><li>Tree construction </li></ul></ul><ul><ul><ul><li>At start, all the training examples are at the root </li></ul></ul></ul><ul><ul><ul><li>Partition examples recursively based on selected attributes </li></ul></ul></ul><ul><ul><li>Tree pruning </li></ul></ul><ul><ul><ul><li>Identify and remove branches that reflect noise or outliers </li></ul></ul></ul><ul><li>Use of decision tree: Classifying an unknown sample </li></ul><ul><ul><li>Test the attribute values of the sample against the decision tree </li></ul></ul>6 5 4 3 2 1 Exs. Smooth Yellow Medium B Smooth Yellow Medium B Rough Red Big A Smooth Red Medium A Smooth Red Medium A Smooth Yellow Small A Surface Color Size Class color yellow red A size small medium B A
  28. 29. <ul><li>Entropy </li></ul><ul><ul><li>Define an entropy function H such that </li></ul></ul><ul><ul><li>where p i : the probability associated with i th class </li></ul></ul><ul><ul><li>For a feature, the entropy is calculated for each value. </li></ul></ul><ul><ul><li>The sum of the entropy weighted by the probability of each value is the entropy for that feature </li></ul></ul><ul><ul><li>Example: Toss a fair coin </li></ul></ul><ul><ul><li>if the coin is not fair, i.e. P heads = 99%, then </li></ul></ul><ul><ul><li>So, by tossing the coin you get very little (extra) information (that you didn’t expect) </li></ul></ul>
  29. 30. <ul><ul><li>In general, if you have p positive examples, and n negative examples </li></ul></ul><ul><ul><ul><li>For p = n  H = 1 </li></ul></ul></ul><ul><ul><ul><li>i.e. originally there is most uncertainty on the eventual outcome (picking up an example) and most to gain by picking the example. </li></ul></ul></ul>
  30. 31. Decision Tree Induction <ul><li>Basic algorithm (a greedy algorithm) </li></ul><ul><ul><li>Tree is constructed in a top-down recursive divide-and-conquer manner </li></ul></ul><ul><ul><li>At start, all the training examples are at the root </li></ul></ul><ul><ul><li>Attributes are categorical ( if continuous-valued, they are discretized in advance ) </li></ul></ul><ul><ul><li>Examples are partitioned recursively based on selected attributes </li></ul></ul><ul><ul><li>Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain ) </li></ul></ul><ul><li>Conditions for stopping partitioning </li></ul><ul><ul><li>All samples for a given node belong to the same class </li></ul></ul><ul><ul><li>There are no remaining attributes for further partitioning </li></ul></ul><ul><ul><li>Majority voting is employed for classifying the leaf </li></ul></ul><ul><ul><li>There are no samples left </li></ul></ul>
  31. 32. Algorithm <ul><li>Select a random subset W (called the window) from the training set T </li></ul><ul><li>Build a DT for the current W </li></ul><ul><ul><li>Select the best feature which minimizes the entropy H (or max. gain) </li></ul></ul><ul><ul><li>Categorize training instances (examples) into subsets by this feature </li></ul></ul><ul><ul><li>Repeat this process recursively until each subset contains instances of one kind (class) or some statistical criterion is satisfied </li></ul></ul><ul><li>Scan the entire training set for exceptions to the DT </li></ul><ul><li>If exceptions are found insert some of them into W and repeat from step 2 </li></ul>
  32. 33. <ul><li>Information Gain </li></ul><ul><ul><li>The information gain from the  attribute test is defined as the difference between the original information requirement and the new requirement </li></ul></ul><ul><ul><ul><li>Note that the Remainder(  ) is an weighted (by attribute values) entropy function </li></ul></ul></ul><ul><ul><li>Maximize Gain(  )  Minimize Remainder(  ) ; and then  is the most informative attribute (‘question’) </li></ul></ul>
  33. 34. The ID3 Algorithm and Quinlan’s C4.5 <ul><li>C4.5 </li></ul><ul><ul><li>Tutorial: http://yoda.cis.temple.edu:8080/UGAIWWW/lectures/C45/ </li></ul></ul><ul><ul><li>Matlab program: http://www.cs.wisc.edu/~olvi/uwmp/msmt.html </li></ul></ul><ul><li>See 5/ C5.0 </li></ul><ul><ul><li>Tutorial: http://borba.ncc.up.pt/niaad/Software/c50/c50manual.html </li></ul></ul><ul><ul><li>Software for Win2000: http://www.rulequest.com/download.html </li></ul></ul>
  34. 35. <ul><ul><li>Example: </li></ul></ul>6 5 4 3 2 1 Exs. Smooth Yellow Medium B Smooth Yellow Medium B Rough Red Big A Smooth Red Medium A Smooth Red Medium A Smooth Yellow Small A Surface Color Size Class color yellow red A size small medium B A color yellow red A ?
  35. 36. <ul><li>Noise and Overfitting </li></ul><ul><ul><li>Question: what about two or more examples with the same description but different classifications? </li></ul></ul><ul><ul><li>Answer: Each leaf node reports either MAJORITY classification or relative frequencies </li></ul></ul><ul><ul><li>Question: what about irrelevant attributes (noise and overfitting)? </li></ul></ul><ul><ul><li>Answer: Tree pruning </li></ul></ul><ul><ul><li>Solution: An information gain close to zero is a good clue to irrelevance, actual number of (+) and (-) exs. In each subset i, p i and n i vs. expected numbers p i and n i assuming true irrelevance </li></ul></ul><ul><ul><li>Where p and n are the total number of positive and negative exs to start with. </li></ul></ul><ul><ul><li>Total deviation (regarding statistical significant) </li></ul></ul><ul><ul><li>Under the null hypothesis, D ~ chi-squared distribution </li></ul></ul>
  36. 37. Extracting Classification Rules from Trees <ul><li>Represent the knowledge in the form of IF-THEN rules </li></ul><ul><li>One rule is created for each path from the root to a leaf </li></ul><ul><li>Each attribute-value pair along a path forms a conjunction </li></ul><ul><li>The leaf node holds the class prediction </li></ul><ul><li>Rules are easier for humans to understand </li></ul><ul><li>Example </li></ul><ul><ul><li>IF age = “<=30” AND student = “ no ” THEN buys_computer = “ no ” </li></ul></ul><ul><ul><li>IF age = “<=30” AND student = “ yes ” THEN buys_computer = “ yes ” </li></ul></ul><ul><ul><li>IF age = “31…40” THEN buys_computer = “ yes ” </li></ul></ul><ul><ul><li>IF age = “>40” AND credit_rating = “ excellent ” THEN buys_computer = “ yes ” </li></ul></ul><ul><ul><li>IF age = “>40” AND credit_rating = “ fair ” THEN buys_computer = “ no ” </li></ul></ul>
  37. 38. Decision Tree <ul><li>Avoid Overfitting in Classification </li></ul><ul><ul><li>The generated tree may overfit the training data </li></ul></ul><ul><ul><ul><li>Too many branches, some may reflect anomalies due to noise or outliers </li></ul></ul></ul><ul><ul><ul><li>Result is in poor accuracy for unseen samples </li></ul></ul></ul><ul><ul><li>Two approaches to avoid overfitting </li></ul></ul><ul><ul><ul><li>Prepruning: Halt tree construction early — do not split a node if this would result in the goodness measure falling below a threshold </li></ul></ul></ul><ul><ul><ul><ul><li>Difficult to choose an appropriate threshold </li></ul></ul></ul></ul><ul><ul><ul><li>Postpruning: Remove branches from a “fully grown” tree — get a sequence of progressively pruned trees </li></ul></ul></ul><ul><ul><ul><ul><li>Use a set of data different from the training data to decide which is the “best pruned tree” </li></ul></ul></ul></ul>
  38. 39. <ul><li>Approaches to Determine the Final Tree Size </li></ul><ul><ul><li>Separate training (2/3) and testing (1/3) sets </li></ul></ul><ul><ul><li>Use cross validation, e.g., 10-fold cross validation </li></ul></ul><ul><ul><li>Use all the data for training </li></ul></ul><ul><ul><ul><li>but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution </li></ul></ul></ul><ul><ul><li>Use minimum description length (MDL) principle: </li></ul></ul><ul><ul><ul><li>halting growth of the tree when the encoding is minimized </li></ul></ul></ul>
  39. 40. Decision Tree <ul><li>Enhancements to basic decision tree induction </li></ul><ul><ul><li>Allow for continuous-valued attributes </li></ul></ul><ul><ul><ul><li>Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals </li></ul></ul></ul><ul><ul><li>Handle missing attribute values </li></ul></ul><ul><ul><ul><li>Assign the most common value of the attribute </li></ul></ul></ul><ul><ul><ul><li>Assign probability to each of the possible values </li></ul></ul></ul><ul><ul><li>Attribute construction </li></ul></ul><ul><ul><ul><li>Create new attributes based on existing ones that are sparsely represented </li></ul></ul></ul><ul><ul><ul><li>This reduces fragmentation, repetition, and replication </li></ul></ul></ul>

×