Lecture 9 slides: Machine learning for Protein Structure ...

400 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
400
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Lecture 9 slides: Machine learning for Protein Structure ...

  1. 1. Introduction to Bioinformatics 9. Machine Learning for Protein Structure Prediction #1 Course 341 Department of Computing Imperial College, London © Simon Colton
  2. 2. Remember the Scenario <ul><li>We have found a gene in mice </li></ul><ul><ul><li>Which when active makes them immune to a disease </li></ul></ul><ul><li>The gene codes for a protein, the protein has a shape and the shape dictates what it does </li></ul><ul><li>Humans share 96% of their genes with mice </li></ul><ul><li>So, what does the human protein look like? </li></ul>
  3. 3. The Database Approach <ul><li>If two sequences are sequentially similar </li></ul><ul><ul><li>Then they are very likely to code for similar proteins </li></ul></ul><ul><li>Find the best match for the mouse gene </li></ul><ul><ul><li>In terms of sequences </li></ul></ul><ul><ul><li>From a large database of individual human genes </li></ul></ul><ul><ul><li>Or from a database of families of genes </li></ul></ul><ul><li>Infer protein structure from knowledge of matched genes </li></ul><ul><ul><li>If lucky, a structure of one of them may already be known </li></ul></ul>
  4. 4. There is another way… <ul><li>Machine learning: general set of techniques </li></ul><ul><ul><li>For teaching a computer to make predictions </li></ul></ul><ul><ul><li>By observing given correct predictions (being trained) </li></ul></ul><ul><li>Special type of prediction </li></ul><ul><ul><li>Classification of objects into classes </li></ul></ul><ul><ul><ul><li>E.g., images into faces/cars/landscapes </li></ul></ul></ul><ul><ul><ul><li>E.g., drugs into toxic/non-toxic ( binary classification) </li></ul></ul></ul><ul><li>We want to predict a protein’s structure </li></ul><ul><ul><li>Given its sequence </li></ul></ul>
  5. 5. A Good Approach <ul><li>Look at regions of a protein </li></ul><ul><ul><li>i.e., lengths of residues </li></ul></ul><ul><li>Define ways to describe the regions </li></ul><ul><ul><li>So that we can infer the structure of a protein </li></ul></ul><ul><ul><ul><li>From a description of all its regions </li></ul></ul></ul><ul><li>Learn methods for predicting: </li></ul><ul><ul><li>What type of region a particular residue will be in </li></ul></ul><ul><li>Apply this to a protein sequence </li></ul><ul><ul><li>To find contiguous regions with same description </li></ul></ul><ul><ul><li>Put regions together to predict entire structure </li></ul></ul>
  6. 6. For example <ul><li> G A G D G A N A A A </li></ul> Alpha Alpha Alpha Alpha Inter Inter Beta Beta Beta Beta Trained Predictor Alpha Helix Beta Sheet Further Processing
  7. 7. Two Main Questions <ul><li>How do we describe protein structures? </li></ul><ul><ul><li>What are alpha helices and beta-sheets? </li></ul></ul><ul><ul><li>Covered in the next lecture </li></ul></ul><ul><li>How do we train our predictors? </li></ul><ul><ul><li>Covered in this lecture (and the start of the next…) </li></ul></ul>
  8. 8. Machine Learning in a Nutshell <ul><li>Examples in </li></ul>Predictor out <ul><li>Learning is by example </li></ul><ul><ul><li>More examples, better predictors </li></ul></ul><ul><ul><li>For some methods, the examples are used once </li></ul></ul><ul><ul><li>For other methods, they are used repeatedly </li></ul></ul>
  9. 9. Machine Learning Considerations <ul><li>What is the problem for the predictor to address? </li></ul><ul><li>What is the nature of our data? </li></ul><ul><li>How will we represent the predictor? </li></ul><ul><li>How will we train the predictor? </li></ul><ul><li>How will we test how good the predictor is? </li></ul>
  10. 10. Types of Learning Problems in Bioinformatics <ul><li>Class membership </li></ul><ul><ul><li>e.g., predictive toxicology </li></ul></ul><ul><li>Prediction of sequences </li></ul><ul><ul><li>e.g., sequences of protein sub-structures </li></ul></ul><ul><li>Classification hierarchies </li></ul><ul><ul><li>e.g., folds, families, super-families </li></ul></ul><ul><li>Shape descriptions </li></ul><ul><ul><li>e.g., binding site descriptions </li></ul></ul><ul><li>Temporal models </li></ul><ul><ul><li>e.g., activity of cells, metabolic pathways </li></ul></ul>
  11. 11. Learning Data <ul><li>Data comes in many forms, in particular: </li></ul><ul><ul><li>Objects (to be classified/predicted for) </li></ul></ul><ul><ul><li>Classifications/predictions of objects </li></ul></ul><ul><ul><li>Features of objects (to use in the prediction) </li></ul></ul><ul><li>Problems with data </li></ul><ul><ul><li>Imprecise information (e.g., badly recorded data) </li></ul></ul><ul><ul><li>Irrelevant information (e.g., additional features) </li></ul></ul><ul><ul><li>Incorrect information (e.g., wrong classifications) </li></ul></ul><ul><ul><li>Missing classifications </li></ul></ul><ul><ul><li>Missing features for sets of objects </li></ul></ul>
  12. 12. Types of Representations <ul><li>Logical </li></ul><ul><ul><li>Decision trees , grammars, logic programs </li></ul></ul><ul><ul><li>Symbolic, understandable representations </li></ul></ul><ul><li>Probabilistic </li></ul><ul><ul><li>Neural networks , Hidden Markov Models , SVMs </li></ul></ul><ul><ul><li>Mathematical functions, not easy to understand </li></ul></ul><ul><li>Mixed </li></ul><ul><ul><li>Bayesian Networks, Stochastic Logic Programs </li></ul></ul><ul><ul><li>Have advantages of both, more difficult to work with </li></ul></ul>
  13. 13. Advantages of Representations <ul><li>Probabilistic models: </li></ul><ul><ul><li>Can handle noisy and imprecise data </li></ul></ul><ul><ul><li>Useful when there is a notion of uncertainty (in data/hypothesis) </li></ul></ul><ul><ul><li>Well-founded (300 years of development) </li></ul></ul><ul><ul><li>Good statistical algorithms for estimation </li></ul></ul><ul><li>Logical models </li></ul><ul><ul><li>Richness of description </li></ul></ul><ul><ul><li>Extensibility - probabilities, sequences, space, time, actions </li></ul></ul><ul><ul><li>Clarity of results </li></ul></ul><ul><ul><li>Well-founded (2400 years of development) </li></ul></ul>
  14. 14. Decision Tree Representations <ul><li>Input is a set of features </li></ul><ul><ul><li>Describing an example/situation </li></ul></ul><ul><li>Many “if-then” choices </li></ul><ul><ul><li>Leaves are decision </li></ul></ul><ul><li>Logical representation: </li></ul><ul><ul><li>“ If then” is implication </li></ul></ul><ul><ul><li>Branches are conjunctions </li></ul></ul><ul><ul><li>Different branches comprise </li></ul></ul><ul><ul><ul><li>A disjunction </li></ul></ul></ul>
  15. 15. Artificial Neural Networks <ul><li>Layers of nodes </li></ul><ul><ul><li>Input is transformed into numbers </li></ul></ul><ul><ul><li>Weighted averages are fed into nodes </li></ul></ul><ul><li>High or low numbers come out of nodes </li></ul><ul><ul><li>A Threshold function determines whether high or low </li></ul></ul><ul><li>Output nodes will “fire” or not </li></ul><ul><ul><li>Determines classification </li></ul></ul><ul><ul><ul><li>For an example </li></ul></ul></ul>
  16. 16. Logic Program Representations <ul><li>Logic programs are a subset of first order logic </li></ul><ul><li>They consist of sets of Horn clauses </li></ul><ul><ul><li>Horn clause: </li></ul></ul><ul><ul><ul><li>A conjunction of literals implying a single literal </li></ul></ul></ul><ul><li>Can easily be interpreted </li></ul><ul><li>At the heart of the Prolog programming language </li></ul>
  17. 17. Learning Decision Trees <ul><li>Problem: what feature do nodes in the tree test? </li></ul><ul><ul><li>And what happens for each case </li></ul></ul><ul><li>ID3 algorithm: </li></ul><ul><ul><li>Uses a notion of “Information gain” </li></ul></ul><ul><ul><li>Based on entropy: how (dis)organised data is </li></ul></ul><ul><ul><li>Chooses the node with the highest information gain </li></ul></ul><ul><ul><ul><li>As the node to add to the tree next </li></ul></ul></ul><ul><ul><li>Then restricts examples for next node </li></ul></ul>
  18. 18. Learning Artificial Neural Networks <ul><li>First problem: layer structure </li></ul><ul><ul><li>Usually done through trial and error </li></ul></ul><ul><li>Main problem: choosing the weights </li></ul><ul><ul><li>Uses a back-propagation algorithm to train them </li></ul></ul><ul><li>Each example is given </li></ul><ul><ul><li>If currently correctly classified, that’s fine </li></ul></ul><ul><ul><li>If not, the errors from the output are passed back </li></ul></ul><ul><ul><ul><li>Propagated in order to change the weights throughout </li></ul></ul></ul><ul><ul><ul><li>Only very small changes are made (avoid un-doing good work) </li></ul></ul></ul><ul><li>Once all examples have been given </li></ul><ul><ul><li>We start again, until some termination conditions (accuracy) met </li></ul></ul><ul><ul><li>Often requires thousands of such training ‘epochs’ </li></ul></ul>
  19. 19. Learning Logic Programs <ul><li>A notion of generality is used </li></ul><ul><ul><li>One logic program is more general than another </li></ul></ul><ul><ul><ul><li>If one follows from another (sort of) [subsumption] </li></ul></ul></ul><ul><li>A search space of logic programs is defined </li></ul><ul><ul><li>Constrained using a language bias </li></ul></ul><ul><li>Some programs search from general to specific </li></ul><ul><ul><li>Using rules of deduction to go between sentences </li></ul></ul><ul><li>Other programs search from specific to general </li></ul><ul><ul><li>Using inverted rules of deduction </li></ul></ul><ul><li>Search is guided by: </li></ul><ul><ul><li>Performance of the LP with respect to classifying training examples </li></ul></ul><ul><ul><li>Information theoretic calculations to avoid over-specialisations </li></ul></ul>
  20. 20. Testing Learned Predictors #1 <ul><li>Imperative to test on unseen examples </li></ul><ul><ul><li>Cannot report accuracy on examples which have been used to train the predictor, because the results will be heavily biased </li></ul></ul><ul><li>Simple method: Hold back </li></ul><ul><ul><li>When number of examples > 200 (roughly) </li></ul></ul><ul><ul><li>Split into a training set and a test set (e.g., 80%/20%) </li></ul></ul><ul><ul><li>Never let the training algorithm see the test set </li></ul></ul><ul><ul><li>Report the accuracy of the predictor on the test set only </li></ul></ul><ul><ul><li>Have to worry about statistics with smaller numbers </li></ul></ul><ul><li>This kind of testing came a little late to bioinformatics </li></ul><ul><ul><li>Beware conclusions drawn about badly tested predictors </li></ul></ul>
  21. 21. N-Fold Cross Validation <ul><li>Leave one out </li></ul><ul><ul><li>For m < 20 examples </li></ul></ul><ul><ul><li>Train on m-1 examples, test predictor on left out example </li></ul></ul><ul><ul><li>Do this for every example and report the average accuracy </li></ul></ul><ul><li>N-fold cross validation </li></ul><ul><ul><li>Randomly split into n mutually exclusive sets (partitions) </li></ul></ul><ul><ul><li>For every set S </li></ul></ul><ul><ul><ul><li>Train using all examples from the other n-1 sets </li></ul></ul></ul><ul><ul><ul><li>Test predictor on S, record the accuracy </li></ul></ul></ul><ul><ul><li>Report the average accuracy over all the sets </li></ul></ul><ul><ul><li>10-fold cross validation is common </li></ul></ul>
  22. 22. Testing Learned Predictors #2 <ul><li>Often different consideration for different contexts </li></ul><ul><ul><li>E.g., false positives/negatives in medical diagnosis </li></ul></ul><ul><li>Confusion matrix </li></ul><ul><ul><li>For binary prediction tasks </li></ul></ul>Predicted F Predicted T number = a number = b (false pos) number = c (false neg) number = d Actually F Actually T <ul><li>Let t = a+b+c+d </li></ul><ul><li>Predictive accuracy = (a+d)/t </li></ul><ul><li>Majority class = max ((a+b)/t, (c+d)/t) </li></ul><ul><li>Precision = Selectivity = d/(b+d) </li></ul><ul><li>Recall = Sensitivity = d/(c+d) </li></ul>
  23. 23. Comparing Learning Methods <ul><li>A very simple method: Majority class predictor </li></ul><ul><ul><li>Predict everything to be in the majority class </li></ul></ul><ul><ul><li>Trained predictors must beat this to be credible </li></ul></ul><ul><li>N-fold cross validation results are compared </li></ul><ul><ul><li>To show an advance in predictor technology </li></ul></ul><ul><li>However, accuracy is not the only consideration </li></ul><ul><ul><li>Speed, memory and comprehensibility </li></ul></ul>
  24. 24. Overfitting <ul><li>It’s easy to over-train predictors </li></ul><ul><li>If a predictor is substantially better for the training set than the test set, it is overfitting </li></ul><ul><ul><li>Essentially, it has memorised aspects of the examples, rather than generalising properties of them </li></ul></ul><ul><ul><li>This is bad: think of a completely new example </li></ul></ul><ul><ul><li>Individual learning schemes have coping methods </li></ul></ul><ul><li>Easy general approach to avoiding overfitting: </li></ul><ul><ul><li>Maintain a validation set to perform tests on during training </li></ul></ul><ul><ul><ul><li>When performance on the validation set degrades, stop learning </li></ul></ul></ul><ul><ul><ul><ul><li>Be careful of blips in predictive accuracy (leave a while, then come back) </li></ul></ul></ul></ul><ul><ul><ul><li>Note: this shouldn’t be used as the testing set </li></ul></ul></ul>

×