1.
CSCI 548/B480: Introduction to Bioinformatics Fall 2002 Dr. Jeffrey Huang, Assistant Professor Department of Computer and Information Science, IUPUI E-mail: huang@cs.iupui.edu Topic 5: Machine Intelligence - Learning and Evolution
classifies data (constructs a model) based on the training set and the values ( class labels ) in a classifying attribute and uses it in classifying new data
Prediction:
models continuous-valued functions, i.e., predicts unknown or missing values
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute
The set of tuples used for model construction: training set
The model is represented as classification rules, decision trees, or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set, otherwise over-fitting will occur
9.
Classification Process Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Model Construction Use the Model in Prediction (Jeff, Professor, 2) Tenured? Training Data Classifier (Model) Testing Data Unseen Data Classifier (Model)
“… the metaphor underlying genetic algorithms is that of natural evolution . In evolution, the problem each species faces is one of searching for beneficial adaptations to a complicated and changing environment . The ‘knowledge’ that each species has gained is embodied in the makeup of chromosomes of its members”
- L. David and M. Steenstrup, “Genetic Algorithms and Simulated Annealing”, pp. 1-11, Kaufmann, 1987
Genetic representation for potential solutions to the problem
A way to create an Initial population of potential solutions
An evaluation function that plays the ole of the environment, rating solutions in term of their “fitness”
i.e. the use of fitness to determine survival and reproductive rates
Genetic operators that alter the composition of children
15.
Evolutionary Algorithm Search Procedure Randomly generate an initial population M(0) Compute and save the fitness u(m) for each individual m in the current population M(t) Define selection probabilities p(m) for each individual m in M(t) so that p(m) is proportional to u(m) Generate M(t+1) by probabilitically selecting individuals to produce offspring via genetic operations ( Crossover and mutation )
Create a population ( pop_size ) of chromosomes, where each chromosome is a binary vector of 14 bits
All 14 bits for each chromosome are initialized randomly
Evaluation function
Evaluation function eval for binary vectors v is equal to the function f :
eval( v ) = f(x)
ex; eval( v 1 )= f(x 1 ) = fitness 1
00000000000000 00000000000001 … … 11111111111111 0.0 4/(2 14 -1) … … 4.0 genotype Phenotype v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 v 9 v 10 v 11 v 12 v 13 v 14 v 15 v 16 v 17 v 18 v 19 v 20 v 21 v 22 v 23 v 24
At start, all the training examples are at the root
Partition examples recursively based on selected attributes
Tree pruning
Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample
Test the attribute values of the sample against the decision tree
6 5 4 3 2 1 Exs. Smooth Yellow Medium B Smooth Yellow Medium B Rough Red Big A Smooth Red Medium A Smooth Red Medium A Smooth Yellow Small A Surface Color Size Class color yellow red A size small medium B A
6 5 4 3 2 1 Exs. Smooth Yellow Medium B Smooth Yellow Medium B Rough Red Big A Smooth Red Medium A Smooth Red Medium A Smooth Yellow Small A Surface Color Size Class color yellow red A size small medium B A color yellow red A ?
Question: what about two or more examples with the same description but different classifications?
Answer: Each leaf node reports either MAJORITY classification or relative frequencies
Question: what about irrelevant attributes (noise and overfitting)?
Answer: Tree pruning
Solution: An information gain close to zero is a good clue to irrelevance, actual number of (+) and (-) exs. In each subset i, p i and n i vs. expected numbers p i and n i assuming true irrelevance
Where p and n are the total number of positive and negative exs to start with.
Total deviation (regarding statistical significant)
Under the null hypothesis, D ~ chi-squared distribution
Be the first to comment