CSCI 548/B480:  Introduction to Bioinformatics Fall 2002 Dr. Jeffrey Huang, Assistant Professor Department of Computer and Information Science, IUPUI E-mail: huang@cs.iupui.edu Topic 5: Machine Intelligence - Learning and Evolution
Machine Intelligence Machine Learning The subfield of AI concerned with intelligent systems that learn.  The computational study of algorithms that improve performance based on experience.  The attempt to build intelligent entities: We must understand intelligent entities first Computational Brain Mathematics: Philosophy staked most of the ideas of AI but to make it a formal science the mathematical formalization is needed in Computation Logic Probability
Behavior-Based AI vs. Knowledge Based Definitions of Machine Learning Reasoning  The effort to make computers think and solve problem The study of mental faculties through the use of computational models Behavior Make machines to perform human actions requiring intelligence Seeks to explain intelligent behavior in terms of computational processes Agents Environment percepts actions sensors effectors agent ?
Operational Agents Operational Views of Intelligence: The ability to perform intellectual tasks Prove theorems, play chess, solve puzzle Focus on what goes on “between the ears” Emphasize the ability to build and effectively use mental models The ability to perform intellectually challenging “real world” tasks Medical diagnosis, tax advising, financial investing Introduce new issues such as: critical interactions with the world, model grounding, uncertainty The ability to survive, adapt, and function in a constantly changing world Autonomous agents Vision, locomotion, and manipulation,… many I/O issues Self-assessment, learning, curiosity, etc.
Building Intelligent Artifacts Symbolic Approaches: Construct goal-oriented symbol manipulation systems Focus on high end abstract thinking Non-symbolic approaches: Build performance-oriented systems Focus on behavior Need both in tightly coupled form Difficult in building such systems Growing need to automate this process Good approach: Evolutionary Algorithms
Behavior-Based AI Behavior-Based AI  vs. Knowledge-Based  "Situated" in environment  Multiple competencies ('routines') Autonomy Adaptation and Competition Artificial Life (A-Life) Agents: Reactive Behavior Abstracting the logical principles of living organism Collective Behavior : Competition and Cooperation
Classification:   predicts categorical class labels classifies data (constructs a model) based on the training set and the values ( class labels ) in a classifying attribute and uses it in classifying new data Prediction:   models continuous-valued functions, i.e., predicts unknown or missing values Classification vs. Prediction
Classification—A Two-Step Process   Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the  class label attribute The set of tuples used for model construction:  training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of  test set  samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur
Classification Process Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’  Model Construction Use the Model in Prediction (Jeff, Professor, 2) Tenured? Training Data Classifier (Model) Testing Data Unseen Data Classifier (Model)
Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
Classification and Prediction Data Preparation Data cleaning Preprocess data in order to reduce noise and handle missing values Relevance analysis  (feature selection) Remove the irrelevant or redundant attributes Data transformation Generalize and/or normalize data Evaluating Classification Methods Predictive accuracy Speed and scalability time to construct the model time to use the model Robustness : handling noise and missing values Scalability : efficiency in disk-resident databases  Interpretability : understanding and insight provided by the model Goodness of rules decision tree size compactness of classification rules
From Learning to Evolutionary Optimization Accomplishing abstract task = Solving problem = searching through a space of potential solution finding the “best solution”    an optimization process Classical Exhaustive Methods?? Large Space??  Special machine learning technique Evolution Algorithms Stochastic Algorithms Search methods model some phenomena: Genetic Inheritance Darwinian strife for survival
“…  the metaphor underlying genetic algorithms is that of  natural evolution .  In evolution, the problem each species faces is one of searching for beneficial  adaptations  to a  complicated and changing environment . The ‘knowledge’ that each species has gained is embodied in the makeup of chromosomes of its members” - L. David and M. Steenstrup, “Genetic Algorithms and Simulated Annealing”, pp. 1-11, Kaufmann, 1987
The Essence Components Genetic representation for potential solutions to the problem A way to create an Initial population of potential solutions An evaluation function that plays the ole of the environment, rating solutions in term of their “fitness” i.e. the use of fitness to determine survival and reproductive rates Genetic operators that alter the composition of children
Evolutionary Algorithm Search Procedure Randomly generate an initial population  M(0) Compute and save the fitness  u(m)  for each individual  m  in the current population  M(t) Define selection probabilities  p(m)  for each individual  m  in  M(t)  so that  p(m)  is proportional to  u(m) Generate  M(t+1)  by probabilitically selecting individuals to produce offspring via genetic operations ( Crossover  and  mutation )
Historical Background Three paradigms emerged in the 1960s: Genetic Algorithms Introduced by Holland (MSU)    De Jong (GMU) Envisioned for broad range of “adaptive systems” Evolution Strategies Introduced by Rechenberg Focused on real-valued parameter optimization Evolutionary Programming Introduced by Fogel and Koza Applied to AI and machine learning problem Today: Wide variety of evolutionary algorithms Applied to many area of science and engineering
Examples of Evolutionary AI Parameter Tuning Pervasiveness of parameterized models Complex behavioral changes due to non-linear interactions Example: Weights of an Artificial Neural networks Parameters of a heuristic evolution function Parameter of a rule induction system Parameter of membership functions Goal: evolve over time useful set of discrete/ continuous parameter
Evolving Structure Effect behavior change via more complex structures Example: Selecting/constructing the topology of ANNs  Selecting/constructing the feature sets Selecting/constructing plans/scenarios Selecting/constructing membership functions Goal: evolve useful structure over time Evolving Programs Goal: acquire new behaviors and adapt existing ones Example: Acquire/adapt behavioral rules sets Acquire/adapt arm/joint control programs Acquire/adapt task-oriented programming code
How Does Genetic Algorithm Work? A simple example of function optimization Find max  f(x)=x 2 , for  x    [0 ,  4] Representation: Genotype (chromosome) :  internally points in the search space are represented as (binary) string over some alphabet Phenotype :  the expressed traits of an individual With a precision for  x  in  [0,4]  of  10 -4   : it needs14 bits 8,000    2 13  < 10,000 < 2 14     16,000 Simple fixed length binary Assigned  0.0  to the string  00 0000 0000 0000 Assign  0.0 + bin2dec(binary string)*4/(2 14  -1) the string  00 0000 0000 0001  and so on Phenotype  4.0  = genotype  11 1111 1111 1111
Initial population: Create a population ( pop_size ) of chromosomes, where each chromosome is a binary vector of 14 bits All 14 bits for each chromosome are initialized randomly Evaluation function Evaluation function  eval  for binary vectors  v  is equal to the function  f : eval( v ) = f(x) ex;  eval( v 1 )= f(x 1 ) = fitness 1 00000000000000 00000000000001 … … 11111111111111 0.0 4/(2 14  -1) … … 4.0 genotype Phenotype v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 v 9 v 10 v 11 v 12 v 13 v 14 v 15 v 16 v 17 v 18 v 19 v 20 v 21 v 22 v 23 v 24
Parameters pop_size  = 24 , Prob. of Xover,  p c  = 0.6 , Prob. of mutation,  p m  = 0.01 Recombination: using genetic operations Crossover ( p c )   v 1  = 0111 1100010011   =>  v 1 ’=  0111 0101011100   v 2  = 0001 0101011100  =>  v 2 ’=  0001 1100010011 Mutation ( p m )   v 2 ’=  000111 0 0010011  =>  v 2 ”=  00011 1 10010011
Selection M(t) from M(t+1): using roulette wheel Total fitness of the population Probability of selection  prob i   for each chromosome  v i Cumulative prob  q i Generate random numbers  r j , from [0,1], where  j  = 1 … pop_size Select chromosome  v i  such that  q i-1  <  r j   <=  q i
 
Homing to the Optimal Solution
Best-so-far Curve
Optimal Feature Subset Search for the Subsets of Discriminatory Features Combination optimization problem Two general approaches to identifying optimal subsets of features:  Abstract measurement for important properties of good feature sets Orthogonality (ex. PCA), information content, low variance Less expensive process Fall in suboptimal performance if the abstract measures do not correlate well with actual performance Building a classifier from the feature subset and evaluating its performance on actual classification tasks.   Better classification performance the cost of building and testing classifiers prohibits any kind of systematic evaluation of feature subsets suboptimal in practice: large numbers of candidate features cannot be handled by any form of systematic search 2 N  possible candidate subsets of N features.
Inductive Learning Learning From Examples Decision Tree (DT):  Information Theory (IT) Question:  what are the BEST attributes (Features) for building the decision tree? Answer: ‘BEST’ attribute is the one that it is ‘MOST’ informative and for whom ‘ambiguity/uncertainty’ is least Solution: Measure (information) contents using the expected amount of information provided by the attribute
Classification by Decision Tree Induction Decision tree  A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes Tree pruning Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree 6 5 4 3 2 1 Exs. Smooth Yellow Medium B Smooth Yellow Medium B Rough Red Big A Smooth Red Medium A Smooth Red Medium A Smooth Yellow Small A Surface Color Size Class color yellow red A size small medium B A
Entropy Define an entropy function H such that where  p i : the probability associated with  i th  class For a feature, the entropy is calculated for each value. The sum of the entropy weighted by the probability of each value is the entropy for that feature Example: Toss a fair coin if the coin is not fair, i.e.  P heads  = 99%, then So, by tossing the coin you get very little (extra) information (that you didn’t expect)
In general, if you have  p  positive examples, and  n  negative examples For  p  =  n      H  = 1 i.e. originally there is most uncertainty on the eventual outcome (picking up an example) and most to  gain  by picking the example.
Decision Tree Induction Basic algorithm (a greedy algorithm) Tree is constructed in a  top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical ( if continuous-valued, they are discretized in advance ) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,  information gain ) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning Majority voting  is employed for classifying the leaf There are no samples left
Algorithm Select a random subset W (called the window) from the training set T Build a DT for the current W Select the best feature which minimizes the entropy H (or max. gain) Categorize training instances (examples) into subsets by this feature Repeat this process recursively until each subset contains instances of one kind (class) or some statistical criterion is satisfied Scan the entire training set for exceptions to the DT If exceptions are found insert some of them into W and repeat from step 2
Information Gain The information gain from the     attribute test is defined as the difference between the original information requirement and the new requirement Note that the Remainder(  )  is an weighted (by attribute values) entropy function Maximize Gain(  )     Minimize   Remainder(  ) ; and then     is the most informative attribute (‘question’)
The ID3 Algorithm and Quinlan’s C4.5 C4.5 Tutorial:  http://yoda.cis.temple.edu:8080/UGAIWWW/lectures/C45/ Matlab program:  http://www.cs.wisc.edu/~olvi/uwmp/msmt.html See 5/ C5.0 Tutorial:  http://borba.ncc.up.pt/niaad/Software/c50/c50manual.html Software for Win2000:  http://www.rulequest.com/download.html
Example: 6 5 4 3 2 1 Exs. Smooth Yellow Medium B Smooth Yellow Medium B Rough Red Big A Smooth Red Medium A Smooth Red Medium A Smooth Yellow Small A Surface Color Size Class color yellow red A size small medium B A color yellow red A ?
Noise and Overfitting Question: what about two or more examples with the same description but different classifications? Answer: Each leaf node reports either MAJORITY classification or relative frequencies Question: what about irrelevant attributes (noise and overfitting)? Answer: Tree pruning Solution: An information gain close to zero is a good clue to irrelevance, actual number of (+) and (-) exs. In each subset i, p i  and n i  vs. expected numbers p i  and n i  assuming true irrelevance Where p and n are the total number of positive and negative exs to start with. Total deviation (regarding statistical significant) Under the null hypothesis,  D ~ chi-squared distribution
Extracting Classification Rules from Trees Represent the knowledge in the form of  IF-THEN  rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example IF  age  = “<=30” AND  student  = “ no ”  THEN  buys_computer  = “ no ” IF  age  = “<=30” AND  student  = “ yes ”  THEN  buys_computer  = “ yes ” IF  age  = “31…40”  THEN  buys_computer  = “ yes ” IF  age  = “>40”  AND  credit_rating  = “ excellent ”  THEN  buys_computer  = “ yes ” IF  age  = “>40” AND  credit_rating  = “ fair ”  THEN  buys_computer  = “ no ”
Decision Tree Avoid Overfitting in Classification The generated tree may overfit the training data   Too many branches, some may reflect anomalies due to noise or outliers Result is in poor accuracy for unseen samples Two approaches to avoid overfitting  Prepruning: Halt tree construction early — do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold Postpruning: Remove branches from a “fully grown” tree — get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree”
Approaches to Determine the Final Tree Size Separate training (2/3) and testing (1/3) sets Use cross validation, e.g., 10-fold cross validation Use all the data for training but apply a  statistical test  (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution Use minimum description length (MDL) principle:  halting growth of the tree when the encoding is minimized
Decision Tree Enhancements to basic decision tree induction Allow for continuous-valued attributes Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals Handle missing attribute values Assign the most common value of the attribute Assign probability to each of the possible values Attribute construction Create new attributes based on existing ones that are sparsely represented This reduces fragmentation, repetition, and replication

Topic_6

  • 1.
    CSCI 548/B480: Introduction to Bioinformatics Fall 2002 Dr. Jeffrey Huang, Assistant Professor Department of Computer and Information Science, IUPUI E-mail: huang@cs.iupui.edu Topic 5: Machine Intelligence - Learning and Evolution
  • 2.
    Machine Intelligence MachineLearning The subfield of AI concerned with intelligent systems that learn. The computational study of algorithms that improve performance based on experience. The attempt to build intelligent entities: We must understand intelligent entities first Computational Brain Mathematics: Philosophy staked most of the ideas of AI but to make it a formal science the mathematical formalization is needed in Computation Logic Probability
  • 3.
    Behavior-Based AI vs.Knowledge Based Definitions of Machine Learning Reasoning The effort to make computers think and solve problem The study of mental faculties through the use of computational models Behavior Make machines to perform human actions requiring intelligence Seeks to explain intelligent behavior in terms of computational processes Agents Environment percepts actions sensors effectors agent ?
  • 4.
    Operational Agents OperationalViews of Intelligence: The ability to perform intellectual tasks Prove theorems, play chess, solve puzzle Focus on what goes on “between the ears” Emphasize the ability to build and effectively use mental models The ability to perform intellectually challenging “real world” tasks Medical diagnosis, tax advising, financial investing Introduce new issues such as: critical interactions with the world, model grounding, uncertainty The ability to survive, adapt, and function in a constantly changing world Autonomous agents Vision, locomotion, and manipulation,… many I/O issues Self-assessment, learning, curiosity, etc.
  • 5.
    Building Intelligent ArtifactsSymbolic Approaches: Construct goal-oriented symbol manipulation systems Focus on high end abstract thinking Non-symbolic approaches: Build performance-oriented systems Focus on behavior Need both in tightly coupled form Difficult in building such systems Growing need to automate this process Good approach: Evolutionary Algorithms
  • 6.
    Behavior-Based AI Behavior-BasedAI vs. Knowledge-Based &quot;Situated&quot; in environment Multiple competencies ('routines') Autonomy Adaptation and Competition Artificial Life (A-Life) Agents: Reactive Behavior Abstracting the logical principles of living organism Collective Behavior : Competition and Cooperation
  • 7.
    Classification: predicts categorical class labels classifies data (constructs a model) based on the training set and the values ( class labels ) in a classifying attribute and uses it in classifying new data Prediction: models continuous-valued functions, i.e., predicts unknown or missing values Classification vs. Prediction
  • 8.
    Classification—A Two-Step Process Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction: training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur
  • 9.
    Classification Process ClassificationAlgorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Model Construction Use the Model in Prediction (Jeff, Professor, 2) Tenured? Training Data Classifier (Model) Testing Data Unseen Data Classifier (Model)
  • 10.
    Supervised vs. UnsupervisedLearning Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
  • 11.
    Classification and PredictionData Preparation Data cleaning Preprocess data in order to reduce noise and handle missing values Relevance analysis (feature selection) Remove the irrelevant or redundant attributes Data transformation Generalize and/or normalize data Evaluating Classification Methods Predictive accuracy Speed and scalability time to construct the model time to use the model Robustness : handling noise and missing values Scalability : efficiency in disk-resident databases Interpretability : understanding and insight provided by the model Goodness of rules decision tree size compactness of classification rules
  • 12.
    From Learning toEvolutionary Optimization Accomplishing abstract task = Solving problem = searching through a space of potential solution finding the “best solution”  an optimization process Classical Exhaustive Methods?? Large Space?? Special machine learning technique Evolution Algorithms Stochastic Algorithms Search methods model some phenomena: Genetic Inheritance Darwinian strife for survival
  • 13.
    “… themetaphor underlying genetic algorithms is that of natural evolution . In evolution, the problem each species faces is one of searching for beneficial adaptations to a complicated and changing environment . The ‘knowledge’ that each species has gained is embodied in the makeup of chromosomes of its members” - L. David and M. Steenstrup, “Genetic Algorithms and Simulated Annealing”, pp. 1-11, Kaufmann, 1987
  • 14.
    The Essence ComponentsGenetic representation for potential solutions to the problem A way to create an Initial population of potential solutions An evaluation function that plays the ole of the environment, rating solutions in term of their “fitness” i.e. the use of fitness to determine survival and reproductive rates Genetic operators that alter the composition of children
  • 15.
    Evolutionary Algorithm SearchProcedure Randomly generate an initial population M(0) Compute and save the fitness u(m) for each individual m in the current population M(t) Define selection probabilities p(m) for each individual m in M(t) so that p(m) is proportional to u(m) Generate M(t+1) by probabilitically selecting individuals to produce offspring via genetic operations ( Crossover and mutation )
  • 16.
    Historical Background Threeparadigms emerged in the 1960s: Genetic Algorithms Introduced by Holland (MSU)  De Jong (GMU) Envisioned for broad range of “adaptive systems” Evolution Strategies Introduced by Rechenberg Focused on real-valued parameter optimization Evolutionary Programming Introduced by Fogel and Koza Applied to AI and machine learning problem Today: Wide variety of evolutionary algorithms Applied to many area of science and engineering
  • 17.
    Examples of EvolutionaryAI Parameter Tuning Pervasiveness of parameterized models Complex behavioral changes due to non-linear interactions Example: Weights of an Artificial Neural networks Parameters of a heuristic evolution function Parameter of a rule induction system Parameter of membership functions Goal: evolve over time useful set of discrete/ continuous parameter
  • 18.
    Evolving Structure Effectbehavior change via more complex structures Example: Selecting/constructing the topology of ANNs Selecting/constructing the feature sets Selecting/constructing plans/scenarios Selecting/constructing membership functions Goal: evolve useful structure over time Evolving Programs Goal: acquire new behaviors and adapt existing ones Example: Acquire/adapt behavioral rules sets Acquire/adapt arm/joint control programs Acquire/adapt task-oriented programming code
  • 19.
    How Does GeneticAlgorithm Work? A simple example of function optimization Find max f(x)=x 2 , for x  [0 , 4] Representation: Genotype (chromosome) : internally points in the search space are represented as (binary) string over some alphabet Phenotype : the expressed traits of an individual With a precision for x in [0,4] of 10 -4 : it needs14 bits 8,000  2 13 < 10,000 < 2 14  16,000 Simple fixed length binary Assigned 0.0 to the string 00 0000 0000 0000 Assign 0.0 + bin2dec(binary string)*4/(2 14 -1) the string 00 0000 0000 0001 and so on Phenotype 4.0 = genotype 11 1111 1111 1111
  • 20.
    Initial population: Createa population ( pop_size ) of chromosomes, where each chromosome is a binary vector of 14 bits All 14 bits for each chromosome are initialized randomly Evaluation function Evaluation function eval for binary vectors v is equal to the function f : eval( v ) = f(x) ex; eval( v 1 )= f(x 1 ) = fitness 1 00000000000000 00000000000001 … … 11111111111111 0.0 4/(2 14 -1) … … 4.0 genotype Phenotype v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 v 9 v 10 v 11 v 12 v 13 v 14 v 15 v 16 v 17 v 18 v 19 v 20 v 21 v 22 v 23 v 24
  • 21.
    Parameters pop_size = 24 , Prob. of Xover, p c = 0.6 , Prob. of mutation, p m = 0.01 Recombination: using genetic operations Crossover ( p c ) v 1 = 0111 1100010011 => v 1 ’= 0111 0101011100 v 2 = 0001 0101011100 => v 2 ’= 0001 1100010011 Mutation ( p m ) v 2 ’= 000111 0 0010011 => v 2 ”= 00011 1 10010011
  • 22.
    Selection M(t) fromM(t+1): using roulette wheel Total fitness of the population Probability of selection prob i for each chromosome v i Cumulative prob q i Generate random numbers r j , from [0,1], where j = 1 … pop_size Select chromosome v i such that q i-1 < r j <= q i
  • 23.
  • 24.
    Homing to theOptimal Solution
  • 25.
  • 26.
    Optimal Feature SubsetSearch for the Subsets of Discriminatory Features Combination optimization problem Two general approaches to identifying optimal subsets of features: Abstract measurement for important properties of good feature sets Orthogonality (ex. PCA), information content, low variance Less expensive process Fall in suboptimal performance if the abstract measures do not correlate well with actual performance Building a classifier from the feature subset and evaluating its performance on actual classification tasks. Better classification performance the cost of building and testing classifiers prohibits any kind of systematic evaluation of feature subsets suboptimal in practice: large numbers of candidate features cannot be handled by any form of systematic search 2 N possible candidate subsets of N features.
  • 27.
    Inductive Learning LearningFrom Examples Decision Tree (DT): Information Theory (IT) Question: what are the BEST attributes (Features) for building the decision tree? Answer: ‘BEST’ attribute is the one that it is ‘MOST’ informative and for whom ‘ambiguity/uncertainty’ is least Solution: Measure (information) contents using the expected amount of information provided by the attribute
  • 28.
    Classification by DecisionTree Induction Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes Tree pruning Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree 6 5 4 3 2 1 Exs. Smooth Yellow Medium B Smooth Yellow Medium B Rough Red Big A Smooth Red Medium A Smooth Red Medium A Smooth Yellow Small A Surface Color Size Class color yellow red A size small medium B A
  • 29.
    Entropy Define anentropy function H such that where p i : the probability associated with i th class For a feature, the entropy is calculated for each value. The sum of the entropy weighted by the probability of each value is the entropy for that feature Example: Toss a fair coin if the coin is not fair, i.e. P heads = 99%, then So, by tossing the coin you get very little (extra) information (that you didn’t expect)
  • 30.
    In general, ifyou have p positive examples, and n negative examples For p = n  H = 1 i.e. originally there is most uncertainty on the eventual outcome (picking up an example) and most to gain by picking the example.
  • 31.
    Decision Tree InductionBasic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical ( if continuous-valued, they are discretized in advance ) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain ) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning Majority voting is employed for classifying the leaf There are no samples left
  • 32.
    Algorithm Select arandom subset W (called the window) from the training set T Build a DT for the current W Select the best feature which minimizes the entropy H (or max. gain) Categorize training instances (examples) into subsets by this feature Repeat this process recursively until each subset contains instances of one kind (class) or some statistical criterion is satisfied Scan the entire training set for exceptions to the DT If exceptions are found insert some of them into W and repeat from step 2
  • 33.
    Information Gain Theinformation gain from the  attribute test is defined as the difference between the original information requirement and the new requirement Note that the Remainder(  ) is an weighted (by attribute values) entropy function Maximize Gain(  )  Minimize Remainder(  ) ; and then  is the most informative attribute (‘question’)
  • 34.
    The ID3 Algorithmand Quinlan’s C4.5 C4.5 Tutorial: http://yoda.cis.temple.edu:8080/UGAIWWW/lectures/C45/ Matlab program: http://www.cs.wisc.edu/~olvi/uwmp/msmt.html See 5/ C5.0 Tutorial: http://borba.ncc.up.pt/niaad/Software/c50/c50manual.html Software for Win2000: http://www.rulequest.com/download.html
  • 35.
    Example: 6 54 3 2 1 Exs. Smooth Yellow Medium B Smooth Yellow Medium B Rough Red Big A Smooth Red Medium A Smooth Red Medium A Smooth Yellow Small A Surface Color Size Class color yellow red A size small medium B A color yellow red A ?
  • 36.
    Noise and OverfittingQuestion: what about two or more examples with the same description but different classifications? Answer: Each leaf node reports either MAJORITY classification or relative frequencies Question: what about irrelevant attributes (noise and overfitting)? Answer: Tree pruning Solution: An information gain close to zero is a good clue to irrelevance, actual number of (+) and (-) exs. In each subset i, p i and n i vs. expected numbers p i and n i assuming true irrelevance Where p and n are the total number of positive and negative exs to start with. Total deviation (regarding statistical significant) Under the null hypothesis, D ~ chi-squared distribution
  • 37.
    Extracting Classification Rulesfrom Trees Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example IF age = “<=30” AND student = “ no ” THEN buys_computer = “ no ” IF age = “<=30” AND student = “ yes ” THEN buys_computer = “ yes ” IF age = “31…40” THEN buys_computer = “ yes ” IF age = “>40” AND credit_rating = “ excellent ” THEN buys_computer = “ yes ” IF age = “>40” AND credit_rating = “ fair ” THEN buys_computer = “ no ”
  • 38.
    Decision Tree AvoidOverfitting in Classification The generated tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Result is in poor accuracy for unseen samples Two approaches to avoid overfitting Prepruning: Halt tree construction early — do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold Postpruning: Remove branches from a “fully grown” tree — get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree”
  • 39.
    Approaches to Determinethe Final Tree Size Separate training (2/3) and testing (1/3) sets Use cross validation, e.g., 10-fold cross validation Use all the data for training but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution Use minimum description length (MDL) principle: halting growth of the tree when the encoding is minimized
  • 40.
    Decision Tree Enhancementsto basic decision tree induction Allow for continuous-valued attributes Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals Handle missing attribute values Assign the most common value of the attribute Assign probability to each of the possible values Attribute construction Create new attributes based on existing ones that are sparsely represented This reduces fragmentation, repetition, and replication