Upcoming SlideShare
Loading in...5







Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Topic_6 Topic_6 Presentation Transcript

  • CSCI 548/B480: Introduction to Bioinformatics Fall 2002 Dr. Jeffrey Huang, Assistant Professor Department of Computer and Information Science, IUPUI E-mail: Topic 5: Machine Intelligence - Learning and Evolution
  • Machine Intelligence
    • Machine Learning
      • The subfield of AI concerned with intelligent systems that learn.
      • The computational study of algorithms that improve performance based on experience.
    • The attempt to build intelligent entities:
      • We must understand intelligent entities first
      • Computational Brain
      • Mathematics:
        • Philosophy staked most of the ideas of AI but to make it a formal science the mathematical formalization is needed in
          • Computation
          • Logic
          • Probability
  • Behavior-Based AI vs. Knowledge Based
    • Definitions of Machine Learning
      • Reasoning
        • The effort to make computers think and solve problem
        • The study of mental faculties through the use of computational models
      • Behavior
        • Make machines to perform human actions requiring intelligence
        • Seeks to explain intelligent behavior in terms of computational processes
    • Agents
    Environment percepts actions sensors effectors agent ?
  • Operational Agents
    • Operational Views of Intelligence:
      • The ability to perform intellectual tasks
        • Prove theorems, play chess, solve puzzle
        • Focus on what goes on “between the ears”
        • Emphasize the ability to build and effectively use mental models
      • The ability to perform intellectually challenging “real world” tasks
        • Medical diagnosis, tax advising, financial investing
        • Introduce new issues such as: critical interactions with the world, model grounding, uncertainty
      • The ability to survive, adapt, and function in a constantly changing world
        • Autonomous agents
        • Vision, locomotion, and manipulation,… many I/O issues
        • Self-assessment, learning, curiosity, etc.
  • Building Intelligent Artifacts
    • Symbolic Approaches:
      • Construct goal-oriented symbol manipulation systems
      • Focus on high end abstract thinking
    • Non-symbolic approaches:
      • Build performance-oriented systems
      • Focus on behavior
    • Need both in tightly coupled form
      • Difficult in building such systems
      • Growing need to automate this process
      • Good approach: Evolutionary Algorithms
    • Behavior-Based AI
      • Behavior-Based AI vs. Knowledge-Based
      • "Situated" in environment
      • Multiple competencies ('routines')
      • Autonomy
      • Adaptation and Competition
    • Artificial Life (A-Life)
      • Agents: Reactive Behavior
      • Abstracting the logical principles of living organism
      • Collective Behavior : Competition and Cooperation
    • Classification:
      • predicts categorical class labels
      • classifies data (constructs a model) based on the training set and the values ( class labels ) in a classifying attribute and uses it in classifying new data
    • Prediction:
      • models continuous-valued functions, i.e., predicts unknown or missing values
    Classification vs. Prediction
  • Classification—A Two-Step Process
    • Model construction: describing a set of predetermined classes
      • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute
      • The set of tuples used for model construction: training set
      • The model is represented as classification rules, decision trees, or mathematical formulae
    • Model usage: for classifying future or unknown objects
      • Estimate accuracy of the model
        • The known label of test sample is compared with the classified result from the model
        • Accuracy rate is the percentage of test set samples that are correctly classified by the model
        • Test set is independent of training set, otherwise over-fitting will occur
  • Classification Process Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Model Construction Use the Model in Prediction (Jeff, Professor, 2) Tenured? Training Data Classifier (Model) Testing Data Unseen Data Classifier (Model)
  • Supervised vs. Unsupervised Learning
    • Supervised learning (classification)
      • Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations
      • New data is classified based on the training set
    • Unsupervised learning (clustering)
      • The class labels of training data is unknown
      • Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
  • Classification and Prediction
    • Data Preparation
      • Data cleaning
        • Preprocess data in order to reduce noise and handle missing values
      • Relevance analysis (feature selection)
        • Remove the irrelevant or redundant attributes
      • Data transformation
        • Generalize and/or normalize data
    • Evaluating Classification Methods
      • Predictive accuracy
      • Speed and scalability
        • time to construct the model
        • time to use the model
      • Robustness : handling noise and missing values
      • Scalability : efficiency in disk-resident databases
      • Interpretability : understanding and insight provided by the model
      • Goodness of rules
        • decision tree size
        • compactness of classification rules
  • From Learning to Evolutionary
    • Optimization
      • Accomplishing abstract task = Solving problem
      • = searching through a space of potential solution
      • finding the “best solution”
        •  an optimization process
      • Classical Exhaustive Methods??
      • Large Space?? Special machine learning technique
    • Evolution Algorithms
      • Stochastic Algorithms
      • Search methods model some phenomena:
        • Genetic Inheritance
        • Darwinian strife for survival
    • “… the metaphor underlying genetic algorithms is that of natural evolution . In evolution, the problem each species faces is one of searching for beneficial adaptations to a complicated and changing environment . The ‘knowledge’ that each species has gained is embodied in the makeup of chromosomes of its members”
      • - L. David and M. Steenstrup, “Genetic Algorithms and Simulated Annealing”, pp. 1-11, Kaufmann, 1987
  • The Essence Components
      • Genetic representation for potential solutions to the problem
      • A way to create an Initial population of potential solutions
      • An evaluation function that plays the ole of the environment, rating solutions in term of their “fitness”
        • i.e. the use of fitness to determine survival and reproductive rates
      • Genetic operators that alter the composition of children
  • Evolutionary Algorithm Search Procedure Randomly generate an initial population M(0) Compute and save the fitness u(m) for each individual m in the current population M(t) Define selection probabilities p(m) for each individual m in M(t) so that p(m) is proportional to u(m) Generate M(t+1) by probabilitically selecting individuals to produce offspring via genetic operations ( Crossover and mutation )
  • Historical Background
    • Three paradigms emerged in the 1960s:
      • Genetic Algorithms
        • Introduced by Holland (MSU)  De Jong (GMU)
        • Envisioned for broad range of “adaptive systems”
      • Evolution Strategies
        • Introduced by Rechenberg
        • Focused on real-valued parameter optimization
      • Evolutionary Programming
        • Introduced by Fogel and Koza
        • Applied to AI and machine learning problem
    • Today:
      • Wide variety of evolutionary algorithms
      • Applied to many area of science and engineering
  • Examples of Evolutionary AI
    • Parameter Tuning
      • Pervasiveness of parameterized models
      • Complex behavioral changes due to non-linear interactions
      • Example:
        • Weights of an Artificial Neural networks
        • Parameters of a heuristic evolution function
        • Parameter of a rule induction system
        • Parameter of membership functions
      • Goal: evolve over time useful set of discrete/ continuous parameter
    • Evolving Structure
      • Effect behavior change via more complex structures
      • Example:
        • Selecting/constructing the topology of ANNs
        • Selecting/constructing the feature sets
        • Selecting/constructing plans/scenarios
        • Selecting/constructing membership functions
      • Goal: evolve useful structure over time
    • Evolving Programs
      • Goal: acquire new behaviors and adapt existing ones
      • Example:
        • Acquire/adapt behavioral rules sets
        • Acquire/adapt arm/joint control programs
        • Acquire/adapt task-oriented programming code
  • How Does Genetic Algorithm Work?
    • A simple example of function optimization
      • Find max f(x)=x 2 , for x  [0 , 4]
      • Representation:
        • Genotype (chromosome) : internally points in the search space are represented as (binary) string over some alphabet
        • Phenotype : the expressed traits of an individual
        • With a precision for x in [0,4] of 10 -4 : it needs14 bits
          • 8,000  2 13 < 10,000 < 2 14  16,000
        • Simple fixed length binary
          • Assigned 0.0 to the string 00 0000 0000 0000
          • Assign 0.0 + bin2dec(binary string)*4/(2 14 -1)
          • the string 00 0000 0000 0001 and so on
          • Phenotype 4.0 = genotype 11 1111 1111 1111
      • Initial population:
        • Create a population ( pop_size ) of chromosomes, where each chromosome is a binary vector of 14 bits
        • All 14 bits for each chromosome are initialized randomly
      • Evaluation function
        • Evaluation function eval for binary vectors v is equal to the function f :
        • eval( v ) = f(x)
        • ex; eval( v 1 )= f(x 1 ) = fitness 1
    00000000000000 00000000000001 … … 11111111111111 0.0 4/(2 14 -1) … … 4.0 genotype Phenotype v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 v 9 v 10 v 11 v 12 v 13 v 14 v 15 v 16 v 17 v 18 v 19 v 20 v 21 v 22 v 23 v 24
      • Parameters
        • pop_size = 24 ,
        • Prob. of Xover, p c = 0.6 ,
        • Prob. of mutation, p m = 0.01
      • Recombination: using genetic operations
        • Crossover ( p c )
        • v 1 = 0111 1100010011 => v 1 ’= 0111 0101011100
        • v 2 = 0001 0101011100 => v 2 ’= 0001 1100010011
        • Mutation ( p m )
        • v 2 ’= 000111 0 0010011 => v 2 ”= 00011 1 10010011
      • Selection M(t) from M(t+1): using roulette wheel
        • Total fitness of the population
        • Probability of selection prob i for each chromosome v i
        • Cumulative prob q i
        • Generate random numbers r j , from [0,1], where j = 1 … pop_size
        • Select chromosome v i such that q i-1 < r j <= q i
  • Homing to the Optimal Solution
  • Best-so-far Curve
  • Optimal Feature Subset
    • Search for the Subsets of Discriminatory Features
      • Combination optimization problem
      • Two general approaches to identifying optimal subsets of features:
        • Abstract measurement for important properties of good feature sets
          • Orthogonality (ex. PCA), information content, low variance
          • Less expensive process
          • Fall in suboptimal performance if the abstract measures do not correlate well with actual performance
        • Building a classifier from the feature subset and evaluating its performance on actual classification tasks.
          • Better classification performance
          • the cost of building and testing classifiers prohibits any kind of systematic evaluation of feature subsets
            • suboptimal in practice: large numbers of candidate features cannot be handled by any form of systematic search
            • 2 N possible candidate subsets of N features.
  • Inductive Learning
    • Learning From Examples
      • Decision Tree (DT):
      • Information Theory (IT)
      • Question: what are the BEST attributes (Features) for building the decision tree?
      • Answer: ‘BEST’ attribute is the one that it is ‘MOST’ informative and for whom ‘ambiguity/uncertainty’ is least
      • Solution: Measure (information) contents using the expected amount of information provided by the attribute
  • Classification by Decision Tree Induction
    • Decision tree
      • A flow-chart-like tree structure
      • Internal node denotes a test on an attribute
      • Branch represents an outcome of the test
      • Leaf nodes represent class labels or class
      • distribution
    • Decision tree generation consists of two phases
      • Tree construction
        • At start, all the training examples are at the root
        • Partition examples recursively based on selected attributes
      • Tree pruning
        • Identify and remove branches that reflect noise or outliers
    • Use of decision tree: Classifying an unknown sample
      • Test the attribute values of the sample against the decision tree
    6 5 4 3 2 1 Exs. Smooth Yellow Medium B Smooth Yellow Medium B Rough Red Big A Smooth Red Medium A Smooth Red Medium A Smooth Yellow Small A Surface Color Size Class color yellow red A size small medium B A
    • Entropy
      • Define an entropy function H such that
      • where p i : the probability associated with i th class
      • For a feature, the entropy is calculated for each value.
      • The sum of the entropy weighted by the probability of each value is the entropy for that feature
      • Example: Toss a fair coin
      • if the coin is not fair, i.e. P heads = 99%, then
      • So, by tossing the coin you get very little (extra) information (that you didn’t expect)
      • In general, if you have p positive examples, and n negative examples
        • For p = n  H = 1
        • i.e. originally there is most uncertainty on the eventual outcome (picking up an example) and most to gain by picking the example.
  • Decision Tree Induction
    • Basic algorithm (a greedy algorithm)
      • Tree is constructed in a top-down recursive divide-and-conquer manner
      • At start, all the training examples are at the root
      • Attributes are categorical ( if continuous-valued, they are discretized in advance )
      • Examples are partitioned recursively based on selected attributes
      • Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain )
    • Conditions for stopping partitioning
      • All samples for a given node belong to the same class
      • There are no remaining attributes for further partitioning
      • Majority voting is employed for classifying the leaf
      • There are no samples left
  • Algorithm
    • Select a random subset W (called the window) from the training set T
    • Build a DT for the current W
      • Select the best feature which minimizes the entropy H (or max. gain)
      • Categorize training instances (examples) into subsets by this feature
      • Repeat this process recursively until each subset contains instances of one kind (class) or some statistical criterion is satisfied
    • Scan the entire training set for exceptions to the DT
    • If exceptions are found insert some of them into W and repeat from step 2
    • Information Gain
      • The information gain from the  attribute test is defined as the difference between the original information requirement and the new requirement
        • Note that the Remainder(  ) is an weighted (by attribute values) entropy function
      • Maximize Gain(  )  Minimize Remainder(  ) ; and then  is the most informative attribute (‘question’)
  • The ID3 Algorithm and Quinlan’s C4.5
    • C4.5
      • Tutorial:
      • Matlab program:
    • See 5/ C5.0
      • Tutorial:
      • Software for Win2000:
      • Example:
    6 5 4 3 2 1 Exs. Smooth Yellow Medium B Smooth Yellow Medium B Rough Red Big A Smooth Red Medium A Smooth Red Medium A Smooth Yellow Small A Surface Color Size Class color yellow red A size small medium B A color yellow red A ?
    • Noise and Overfitting
      • Question: what about two or more examples with the same description but different classifications?
      • Answer: Each leaf node reports either MAJORITY classification or relative frequencies
      • Question: what about irrelevant attributes (noise and overfitting)?
      • Answer: Tree pruning
      • Solution: An information gain close to zero is a good clue to irrelevance, actual number of (+) and (-) exs. In each subset i, p i and n i vs. expected numbers p i and n i assuming true irrelevance
      • Where p and n are the total number of positive and negative exs to start with.
      • Total deviation (regarding statistical significant)
      • Under the null hypothesis, D ~ chi-squared distribution
  • Extracting Classification Rules from Trees
    • Represent the knowledge in the form of IF-THEN rules
    • One rule is created for each path from the root to a leaf
    • Each attribute-value pair along a path forms a conjunction
    • The leaf node holds the class prediction
    • Rules are easier for humans to understand
    • Example
      • IF age = “<=30” AND student = “ no ” THEN buys_computer = “ no ”
      • IF age = “<=30” AND student = “ yes ” THEN buys_computer = “ yes ”
      • IF age = “31…40” THEN buys_computer = “ yes ”
      • IF age = “>40” AND credit_rating = “ excellent ” THEN buys_computer = “ yes ”
      • IF age = “>40” AND credit_rating = “ fair ” THEN buys_computer = “ no ”
  • Decision Tree
    • Avoid Overfitting in Classification
      • The generated tree may overfit the training data
        • Too many branches, some may reflect anomalies due to noise or outliers
        • Result is in poor accuracy for unseen samples
      • Two approaches to avoid overfitting
        • Prepruning: Halt tree construction early — do not split a node if this would result in the goodness measure falling below a threshold
          • Difficult to choose an appropriate threshold
        • Postpruning: Remove branches from a “fully grown” tree — get a sequence of progressively pruned trees
          • Use a set of data different from the training data to decide which is the “best pruned tree”
    • Approaches to Determine the Final Tree Size
      • Separate training (2/3) and testing (1/3) sets
      • Use cross validation, e.g., 10-fold cross validation
      • Use all the data for training
        • but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution
      • Use minimum description length (MDL) principle:
        • halting growth of the tree when the encoding is minimized
  • Decision Tree
    • Enhancements to basic decision tree induction
      • Allow for continuous-valued attributes
        • Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals
      • Handle missing attribute values
        • Assign the most common value of the attribute
        • Assign probability to each of the possible values
      • Attribute construction
        • Create new attributes based on existing ones that are sparsely represented
        • This reduces fragmentation, repetition, and replication