[Talk]

678 views
549 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
678
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
22
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

[Talk]

  1. 1. Machine Learning Algorithms for Protein Structure Prediction Jianlin Cheng Institute for Genomics and Bioinformatics School of Information and Computer Sciences University of California Irvine 2006
  2. 2. Outline I. Introduction II. 1D Prediction III. 2D Prediction (Beta-Sheet Topology) IV. 3D Prediction (Fold Recognition) V. Publications and Bioinformatics Tools
  3. 3. Importance of Protein Structure Prediction AGCWY…… Cell Sequence Structure Function
  4. 4. Four Levels of Protein Structure Primary Structure (a directional sequence of amino acids/residues) N C … Residue1 Residue2 Peptide bond Secondary Structure (helix, strand, coil) Alpha Helix Beta Strand / Sheet Coil
  5. 5. Four Levels of Protein Structure Tertiary Structure Quaternary Structure (complex) G Protein Complex
  6. 6. 1D: Secondary Structure Prediction MWLKKFGINLLIGQSV… Helix Neural Networks Coil + Alignments CCCCHHHHHCCCSSSSS… Strand Accuracy: 78% Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005
  7. 7. 1D: Solvent Accessibility Prediction Exposed MWLKKFGINLLIGQSV… Neural Networks + Alignments eeeeeeebbbbbbbbeeeebbb… Buried Accuracy: 79% Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005
  8. 8. 1D: Disordered Region Prediction Using Neural Networks MWLKKFGINLLIGQSV… Disordered Region 1D-RNN OOOOODDDDOOOOO… 93% TP at 5% FP Cheng, Sweredoski, Baldi. Data Mining and Knowledge Discovery, 2005
  9. 9. 1D: Protein Domain Prediction Using Neural Networks MWLKKFGINLLIGQSV… Boundary + SS and SA 1D-RNN NNNNNNNBBBBBNNNN… HIV capsid protein Inference/Cut Domain 1 Domain 2 Domains Top ab-initio domain predictor in CAFASP4 Cheng, Sweredoski, Baldi. Data Mining and Knowledge Discovery, 2006.
  10. 10. 1D: Predict Single-Site Mutation From Sequence Using Support Vector Machine Correlation = 0.76 Support …MWLAVFILINLK… Vector Machine • First method to predict energy changes from sequence accurately • Useful for protein engineering, protein design, and mutagenesis analysis Cheng, Randall, and Baldi. Proteins, 2006
  11. 11. 2D: Contact Map Prediction 3D Structure 2D Contact Map 1 2 ………..………..…j...…………………..…n 1 2 3 . . . . i . . . . . . . n Distance Threshold = 8Ao Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005
  12. 12. 2D: Disulfide Bond Prediction Cysteine i Support yes 2D-RNN Vector Machine Disulfide Bond Graph Cysteine j Matching [1] Baldi, Cheng, Vullo. NIPS, 2004. [2] Cheng, Saigo, Baldi. Proteins, 2005
  13. 13. 2D: Prediction of Beta-Sheet Topology N terminus Beta Sheet • Ab-Initio Structure Prediction • Fold Recognition Beta Strand • Protein Design • Protein Folding Cheng and Baldi, Bioinformatics, 2005 C terminus Beta Residue Pair
  14. 14. An Example of Beta-Sheet Topology Level 1 4 5 2 1 3 6 7 Structure of Beta Sheets Protein 1VJG
  15. 15. An Example of Beta-Sheet Topology Level 1 Level 2 4 5 Antiparallel 2 1 3 6 7 Parallel Structure of Beta Sheets Strand Protein 1VJG Strand Pair Strand Alignment Pairing Direction
  16. 16. An Example of Beta-Sheet Topology Level 1 Level 2 Level 3 4 5 Antiparallel H-bond 2 1 3 6 7 Parallel Structure of Beta Sheets Strand Beta Residue Protein 1VJG Strand Pair Residue Pair Strand Alignment Pairing Direction
  17. 17. Three-Stage Prediction of Beta- Sheets • Stage 1 Predict beta-residue pairing probabilities using 2D-Recursive Neural Networks (2D- RNN, Baldi and Pollastri, 2003) • Stage 2 Use beta-residue pairing probabilities to align beta-strands • Stage 3 Predict beta-strand pairs and beta-sheet topology using graph algorithms
  18. 18. Stage 1: Prediction of Beta-Residue Pairings Using 2D-Recusive Neural Networks Input Matrix I (m×m) Output / Target Matrix (m×m) Iij (i,j) 2D-RNN O = f(I) i j Oij: Pairing Prob. Tij: 0/1 …AHYHCKRWQNEDGHTPRKDECLIELMQDAQRMRK…. 20 for Residues 3 SS 2 SA
  19. 19. An Example (Target) 1 2 3 45 6 7 Protein 1VJG Beta-Residue Pairing Map (Target Matrix)
  20. 20. An Example (Target) 1 2 3 45 6 7 Antiparallel Parallel Protein 1VJG Beta-Residue Pairing Map (Target Matrix)
  21. 21. An Example (Prediction)
  22. 22. Stage 2: Beta-Strand Alignment Antiparallel • Use output probability matrix as scoring matrix 1 m • Dynamic programming n 1 • Disallow gaps and use Parallel the simplified search algorithm 1 m 1 n Total number of alignments = 2(m+n-1)
  23. 23. Strand Alignment and Pairing Matrix • The alignment score is the sum of the pairing probabilities of the aligned residues • The best alignment is the alignment with the maximum score • Strand Pairing Matrix Strand Pairing Matrix of 1VJG
  24. 24. Stage 3: Prediction of Beta-Strand Pairings and Beta-Sheet Topology (a) Seven strands of protein 1VJG in sequence order (b) Beta-sheet topology of protein 1VJG
  25. 25. Minimum Spanning Tree Like Algorithm Strand Pairing Graph (SPG) (a) Complete SPG Strand Pairing Matrix
  26. 26. Minimum Spanning Tree Like Algorithm Strand Pairing Graph (SPG) (a) Complete SPG (b) True Weighted SPG Strand Pairing Matrix Goal: Find a set of connected subgraphs that maximize the sum of the alignment scores and satisfy the constraints Algorithm: Minimum Spanning Tree Like Algorithm
  27. 27. An Example of MST Like Algorithm 1 2 3 4 5 6 7 1 0 Step 1: Pair strand 4 and 5 2 1.3 0 3 .94 .37 0 4 .02 .02 .04 0 4 5 5 .02 .02 .03 1.9 0 6 .10 .05 .74 .04 .04 0 7 .02 .02 .03 .02 .02 .20 0 Strand Pairing Matrix of 1VJG
  28. 28. An Example of MST Like Algorithm 1 2 3 4 5 6 7 1 0 Step 2: Pair strand 1 and 2 2 1.3 0 3 .94 .37 0 4 .02 .02 .04 0 4 5 5 .02 .02 .03 1.9 0 6 .10 .05 .74 .04 .04 0 7 .02 .02 .03 .02 .02 .20 0 Strand Pairing Matrix of 1VJG 2 1 N
  29. 29. An Example of MST Like Algorithm 1 2 3 4 5 6 7 1 0 Step 3: Pair strand 1 and 3 2 1.3 0 3 .94 .37 0 4 .02 .02 .04 0 4 5 5 .02 .02 .03 1.9 0 6 .10 .05 .74 .04 .04 0 7 .02 .02 .03 .02 .02 .20 0 Strand Pairing Matrix of 1VJG 2 1 3 N
  30. 30. An Example of MST Like Algorithm 1 2 3 4 5 6 7 1 0 Step 4: Pair strand 3 and 6 2 1.3 0 3 .94 .37 0 4 .02 .02 .04 0 4 5 5 .02 .02 .03 1.9 0 6 .10 .05 .74 .04 .04 0 7 .02 .02 .03 .02 .02 .20 0 Strand Pairing Matrix of 1VJG 6 2 1 3 N
  31. 31. An Example of MST Like Algorithm 1 2 3 4 5 6 7 1 0 Step 5: Pair strand 6 and 7 2 1.3 0 3 .94 .37 0 4 .02 .02 .04 0 4 5 5 .02 .02 .03 1.9 0 6 .10 .05 .74 .04 .04 0 C 7 .02 .02 .03 .02 .02 .20 0 Strand Pairing Matrix of 1VJG 7 6 2 1 3 N
  32. 32. 1.Beta Residue Pairing Method Specificity/ Ratio of Sensitivity Improvement BetaPairing 41% 17.8 CMAPpro 27% 11.7 (Pollastri and Baldi, 2002) 2. Beta Strand Alignment Method Alignment Pairing Accuracy Direction BetaPairing 66% 84% Statistical Potential (Hubbard, 1994) 40% X Pseudo-energy (Zhu and Braun, 1999) 35% X Information Theory (Steward and Thornton, 2002) 37% X 3. Beta Strand Pairing Method Specificity Sensitivity % of non-local pairs MST Like 53% 59% 20%
  33. 33. 3D Structure Prediction MWLKKFGINLLIGQSV… •Ab-Initio Structure Prediction Simulation Physical force field – protein folding …… Contact map - reconstruction Select structure with minimum free energy •Template-Based Structure Prediction Query protein Fold MWLKKFGINKH… Recognition Alignment Template Protein Data Bank
  34. 34. A Machine Learning Information Retrieval Framework for Fold Recognition Fold Recognition Cheng and Baldi, Bioinformatics, 2006 Query Protein Alignment MWLKKFGIN…… Template Protein Data Bank Machine Learning Ranking
  35. 35. Classic Fold Recognition Approaches Sequence - Sequence Alignment (Needleman and Wunsch, 1970. Smith and Waterman, 1981) Query ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL Template ITAKPQWLKTSE------------SVTFLSFLLPQTQGLYHL Alignment (similarity) score Works for >40% sequence identity (Close homologs in protein family)
  36. 36. Classic Fold Recognition Approaches Profile - Sequence Alignment (Altschul et al., 1997) ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL Query ITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHL Family ITAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHL ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL Average Score Template ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN More sensitive for distant homologs in superfamily. (> 25% identity)
  37. 37. Classic Fold Recognition Approaches Profile - Sequence Alignment (Altschul et al., 1997) 12………………………………….………………n 1 2 … n ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL A 0.4 Query ITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHL C 0.1 Family ITAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHL … ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL W 0.5 Position Specific Scoring Matrix Or Hidden Markov Model Template ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN More sensitive for distant homologs in superfamily. (> 25% identity)
  38. 38. Classic Fold Recognition Approaches Profile - Profile Alignment (Rychlewski et al., 2000) 1 2 … n ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL A 0.1 Query ITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHL C 0.4 Family ILAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHL … ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL W 0.5 1 2 … m Template ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN A 0.3 IPARPQWLKTSKRSTEWQSVTFLSFLLPYTQGLYHN C 0.5 Family IGAKPQWLWTSERSTEWHSVTFLSFLLPQTQGLYHM … W 0.2 More sensitive for very distant homologs. (> 15% identity)
  39. 39. Classic Fold Recognition Approaches Sequence - Structure Alignment (Threading) (Bowie et al., 1991. Jones et al., 1992. Godzik, Skolnick, 1992. Lathrop, 1994) Query Fit Fitness MWLKKFGINLLIGQS…. Score Template Structure Useful for recognizing similar folds without sequence similarity. (no evolutionary relationship)
  40. 40. Integration of Complementary Approaches FR Server1 Query Meta Server FR server2 Consensus (Lundstrom et al.,2001. Fischer, 2003) FR server3 Internet 1. Reliability depends on availability of external servers 2. Make decisions on a handful candidates
  41. 41. Machine Learning Classification Approach Support Vector Machine (SVM) Class 1 Proteins Class 2 Class m Classify individual proteins to several or dozens of structure classes (Jaakkola et al., 2000. Leslie et al., 2002. Saigo et al., 2004) Problem 1: can’t scale up to thousands of protein classes Problem 2: doesn’t provide templates for structure modeling
  42. 42. Machine Learning Information Retrieval Framework Query-Template Pair Relevance Function (e.g., SVM) Score 1 + Score 2 Rank . . - . Score n • Extract pairwise features • Comparison of two pairs (four proteins) • Relevant or not (one score) vs. many classes • Ranking of templates (retrieval)
  43. 43. Pairwise Feature Extraction • Sequence / Family Information Features Cosine, correlation, and Gaussian kernel • Sequence – Sequence Alignment Features Palign, ClustalW • Sequence – Profile Alignment Features PSI-BLAST, IMPALA, HMMer, RPS-BLAST • Profile – Profile Alignment Features ClustalW, HHSearch, Lobster, Compass, PRC-HMM • Structural Features Secondary structure, solvent accessibility, contact map, beta- sheet topology
  44. 44. Pairwise Feature Extraction
  45. 45. Relevance Function: Support Vector Machine Learning Feature Space Positive Pairs (Same Folds) Support Negative Pairs Vector (Different Folds) Machine Training/Learning Hyperplane Training Data Set
  46. 46. Relevance Function: Support Vector Machine Learning (1) (2) Margin Margin f(x) = K is Gaussian Kernel:
  47. 47. Training and Cross-Validation • Standard benchmark (Lindahl’s dataset, 976 proteins) • 976 x 975 query-template pairs (about 7,468 positives) Query Query 1’s pairs 1 975 pairs 2 Query 2’s pairs Train / Learn 3 975 pairs . . . . . . (90%: 1- 878) Rank 975 . Test templates . (10%: 879 – 976) 975 pairs for each 976 query
  48. 48. Results for Top Five Ranked Templates Method Family Superfamily Fold PSI-BLAST 72.3 27.9 4.7 HMMER 73.5 31.3 14.6 SAM-T98 75.4 38.9 18.7 BLASTLINK 78.9 4.06 16.5 SSEARCH 75.5 32.5 15.6 SSHMM 71.7 31.6 24 THREADER 58.9 24.7 37.7 FUGUE 85.8 53.2 26.8 RAPTOR 77.8 50 45.1 SPARKS3 86.8 67.7 47.4 FOLDpro 89.9 70.0 48.3 •Family: close homologs, more identity •Superfamily: distant homologs, less identity •Fold: no evolutionary relation, no identity
  49. 49. Specificity-Sensitivity Plot (Family)
  50. 50. Specificity-Sensitivity Plot (Superfamily)
  51. 51. Specificity-Sensitivity Plot (Fold)
  52. 52. Advantages of MLIR Framework • Integration • Accuracy • Extensibility • Simplicity • Reliability • Completeness • Potentials Disadvantages Slower than some alignment methods
  53. 53. A CASP7 Example: T0290 Query sequence (173 residues): RPRCFFDIAINNQPAGRVVFELFSDVCPKTCENFRCLCTGEKGTGKSTQKPLHYKSCLFHRVVKDFM VQGGDFSEGNGRGGESIYGGFFEDESFAVKHNAAFLLSMANRGKDTNGSQFFITKPTPHLDGHHVV FGQVISGQEVVREIENQKTDAASKPFAEVRILSCGELIP FOLDpro Compare with the experimental structure: RMSD = 1Ao Predicted Structure
  54. 54. Publications and Bioinformatics Tools 1. P. Baldi, J. Cheng, and A. Vullo. Large-Scale Prediction of Disulphide Bond Connectivity. NIPS 2004. [DIpro 1.0] 2. J. Cheng, H. Saigo, and P. Baldi. Large-Scale Prediction of Disulphide Bridges Using Kernel Methods, Two-Dimensional Recursive Neural Networks, and Weighted Graph Matching. Proteins, 2006. [DIpro 2.0] 3. J. Cheng and P. Baldi. Three-Stage Prediction of Protein Beta-Sheets by Neural Networks, Alignments, and Graph Algorithms. Bioinformatics, 2005. [BETApro] 4. J. Cheng, A. Randall, M. Sweredoski, and P. Baldi. SCRATCH: a Protein Structure and Structural Feature Prediction Server. Nucleic Acids Research, 2005. [SSpro 4/ACCpro 4/CMAPpro 2] 5. J. Cheng, M. Sweredoski, and P. Baldi. Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data. Data Mining and Knowledge Discovery, 2005. [DISpro]
  55. 55. Publications and Bioinformatics Tools 6. J. Cheng, L. Scharenbroich, P. Baldi, and E. Mjolsness. Sigmoid: Towards a Generative, Scalable, Software Infrastructure for Pathway Bioinformatics and Systems Biology. IEEE Intelligent Systems, 2005. [Sigmoid] 7. J. Cheng, A. Randall, and P. Baldi. Prediction of Protein Stability Changes for Single Site Mutations Using Support Vector Machines. Proteins, 2006. [MUpro] 8. S. A. Danziger, S. J. Swamidass, J. Zeng, L. R. Dearth, Q. Lu, J. H. Chen, J. Cheng, V. P. Hoang, H. Saigo, R. Luo, P. Baldi, R. K. Brachmann, and R. H. Lathrop. Functional Census of Mutation Sequence Spaces: The Example of p53 Cancer Rescue Mutants. IEEE Transactions on Computational Biology and Bioinformatics, 2006. 9. J. Cheng, M. Sweredoski, and P. Baldi. DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks. Data Mining and Knowledge Discovery, 2006. [DOMpro] 10. J. Cheng and P. Baldi. A Machine Learning Information Retrieval Approach to Protein Fold Recognition. Bioinformatics, 2006. [FOLDpro]
  56. 56. Acknowledgements • Pierre Baldi • G. Wesley Hatfield, Eric Mjolsness, Hal Stern, Dennis Decoste, Suzanne Sandmeyer, Richard Lathrop, Gianluca Pollastri, Chin- Rang Yang • Mike Sweredoski, Arlo Randall, Liza Larsen, Sam Danziger, Trent Su, Hiroto Saigo, Alessandro Vullo, Lucas Scharenbroich
  57. 57. Markov Models
  58. 58. 1D-Recursive Neural Network
  59. 59. 2D-Recursive Neural Network
  60. 60. 2D-RNNs
  61. 61. 2D RNNs

×