Protein loop classification           using Artificial Neural                  Networks       Armando Vieira1 and Baldomer...
XXI: the century of BIO
BIOINFORMATICSjoining two worlds apart
OutlineBrief review of protein structureStatement of problem and why is so hardData pre-processing, corrections, updatesan...
ProteinsAll proteins are chains of 20 amino acidsNot all chains of amino acids are proteinsFold rapidly and repeatedlyProt...
The Gist of itAmino acid        Physical    Function sequence         structure
Typical globular proteinMEMEKKEFHIVAETGMEKKEFHIVAMEKKMEKIHARPATLLVQTASLFNSDINLETLGKSVNLKSIMGVMSLGVGQGSDVTITVDGADEADGMAAIVE...
Coarse-Grained Model
+180  b    b   b   p   o   M   e   e    e  b    b   b   p   o   M   M   e    e  b    b   b   p   .   l   l   s    e  a    ...
Ramachandran Alphabet    180°                B     90°ψ    0°                       A         G     -90°                  ...
5-letter alphabetResidue Sequence    3° Structure MEKKEFHIVAET      ACCDECBAABDE GIHARPATLLVQT     CBDABCDBEABD ASLFNSDINL...
What shall we do?• Ab initio:  Quantum Mechanics +  big computers +  large # configurations= huge problems…• Machine Learn...
Machine Learning Approach
Artificial Neural Networks• A problem-solving paradigm modeled after the  physiological functioning of the human brain.• S...
Neural NetworksInputLayer                Hidden                Layers                          Output                     ...
Overfitting – high risk!Less complicated hypothesis has lower error rate
Hidden Layer Vector        Quantization- HLVQ   Traditional NN                  HLVQ                            z         ...
Loops, loops everywhere!!!
Look for a loop…
Geometry of the Motif
Loop Typesα−α : α -helix - α -helixα−β : α -helix – β strandβ−α : β strand - α -helixβ -hairpin: β strand - β strandβ - li...
α−α       Similar conformation aa{b}aa / aa{p}aa       Identical geometry (4,6)(0,45)(45,90)(180,225)              Pro 75%...
Class α−α
ArchDB database~ 20 000 loops classified into ~ 3000 classes.          EE-3.4.1 Loop type - loop size . consensus . motifT...
Problems• Coding of aminoacids• Huge searching space, sparsely populated• How to assign the loop classes?• High dimensiona...
Aminoacid coding the classical way      A → (1, 0, …0)      C → (0, 1, …0)      Y → (0, 0, …1)  Useful but not efficient!!...
Theory; but how about applications?!
β-β link and β-β harpins from          sequence   HLVQ         Predicted Predicted   (MLP)        β-β link β-β harpin   Re...
Prediction of all loop types   from sequence alone          β-β lk   α-β    β-β hp   β-α    α-α β-β lk   45.9     28.5    ...
What’s it all mean?Given a loop residue sequence, we can(usually) identify its native structure.Not ab initio: We cannot t...
Future WorkBetter coding of aminoacidsLarger sequences / low complexityGoing beyond structureClever alphabet that explore ...
Beyond Multiple Alignments• Alligments are good … but expensive and  boring ...• Information contained in a multiple  alig...
Coded Amino Acids  Alanine (A)          Arginine (R)         Asparagine (N) Aspartic Acid (D) Cysteine (C)Glutamic Acid (E...
ArchDB databaseProtein Data Bank (PDB)http://www.rcsb.org contains ~ 25 000proteins with known structure of ~ 106entries i...
Upcoming SlideShare
Loading in...5
×

Barcelona sabatica

81

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
81
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Barcelona sabatica

  1. 1. Protein loop classification using Artificial Neural Networks Armando Vieira1 and Baldomero Oliva21 ISEP and Centro de Física Computacional, Coimbra, Portugal www.defi.isep.ipp.pt/~asv 2 Structural Bioinformatics Laboratory (GRIB) IMIM/Universitat Pompeu Fabra, Barcelona, Spain
  2. 2. XXI: the century of BIO
  3. 3. BIOINFORMATICSjoining two worlds apart
  4. 4. OutlineBrief review of protein structureStatement of problem and why is so hardData pre-processing, corrections, updatesand beyond multiple alignments…Neural Networks in protein structurepredictionHLVQResults and future work
  5. 5. ProteinsAll proteins are chains of 20 amino acidsNot all chains of amino acids are proteinsFold rapidly and repeatedlyProteins are the machinery of liveEssential to all (known) organisms
  6. 6. The Gist of itAmino acid Physical Function sequence structure
  7. 7. Typical globular proteinMEMEKKEFHIVAETGMEKKEFHIVAMEKKMEKIHARPATLLVQTASLFNSDINLETLGKSVNLKSIMGVMSLGVGQGSDVTITVDGADEADGMAAIVETLQLQGLAQ
  8. 8. Coarse-Grained Model
  9. 9. +180 b b b p o M e e e b b b p o M M e e b b b p . l l s e a a a T . l l g N Ψ N a a a . U l g N N a a a . U g g N I a a a . G G G I e F F F o e e e e b b b p o e e e e -180-180 φ +180
  10. 10. Ramachandran Alphabet 180° B 90°ψ 0° A G -90° E -180° -180° -90° 0° 90° 180° φ
  11. 11. 5-letter alphabetResidue Sequence 3° Structure MEKKEFHIVAET ACCDECBAABDE GIHARPATLLVQT CBDABCDBEABD ASLFNSDINLETL BCBDBAEBDBDB GKSVNLKSIMGV AEBABDCBBDBA MSLGVGQGSDVT DDCBDBCBDBEB ITVDGADEADGM DBCBBDCAABDE AAIVETLQLQGLA DCDCEAABACAA Q... AADC…
  12. 12. What shall we do?• Ab initio: Quantum Mechanics + big computers + large # configurations= huge problems…• Machine Learning:Use known cases to learn a suitable map: sequence→ structure
  13. 13. Machine Learning Approach
  14. 14. Artificial Neural Networks• A problem-solving paradigm modeled after the physiological functioning of the human brain.• Synapses in the brain are modeled by computational nodes.• The firing of a synapse is modeled by input, output, and threshold functions.• The network “learns” based on problems to which answers are known (supervised learning).• The network can then produce answers to entirely new problems of the same type.
  15. 15. Neural NetworksInputLayer Hidden Layers Output Layer
  16. 16. Overfitting – high risk!Less complicated hypothesis has lower error rate
  17. 17. Hidden Layer Vector Quantization- HLVQ Traditional NN HLVQ z o o o o o o x x o oo o oo x x ox x ox x xxx xxx x xMain advantage: detect and correct prediction for outliers
  18. 18. Loops, loops everywhere!!!
  19. 19. Look for a loop…
  20. 20. Geometry of the Motif
  21. 21. Loop Typesα−α : α -helix - α -helixα−β : α -helix – β strandβ−α : β strand - α -helixβ -hairpin: β strand - β strandβ - link: β strand - β strand
  22. 22. α−α Similar conformation aa{b}aa / aa{p}aa Identical geometry (4,6)(0,45)(45,90)(180,225) Pro 75% Ser 75% 1.3.1 aa{p}aa 1.1.2 aa{b}aa© Baldomero Oliva
  23. 23. Class α−α
  24. 24. ArchDB database~ 20 000 loops classified into ~ 3000 classes. EE-3.4.1 Loop type - loop size . consensus . motifTASK: classify a loop from sequence aloneIf not possible, get as much information aspossible
  25. 25. Problems• Coding of aminoacids• Huge searching space, sparsely populated• How to assign the loop classes?• High dimensionality → Large Networks → poor generalization
  26. 26. Aminoacid coding the classical way A → (1, 0, …0) C → (0, 1, …0) Y → (0, 0, …1) Useful but not efficient!!!I am working to improve it…
  27. 27. Theory; but how about applications?!
  28. 28. β-β link and β-β harpins from sequence HLVQ Predicted Predicted (MLP) β-β link β-β harpin Real 88.4 11.6 β-β link (79.4) (20.6) Real 12.5 87.5 β-β harpin (16.1) (83.9)
  29. 29. Prediction of all loop types from sequence alone β-β lk α-β β-β hp β-α α-α β-β lk 45.9 28.5 3.7 19.8 2.1 α-β 8.8 67.4 1.2 18.0 4.6 β-β hp 0.4 0.9 96.1 2.1 0.5 β-α 4.4 6.2 2.4 79.5 7.6 α-α 4.0 15.7 1.3 20.3 58.6
  30. 30. What’s it all mean?Given a loop residue sequence, we can(usually) identify its native structure.Not ab initio: We cannot tell the structureof a novel sequence.HLVQ is superior to MLP
  31. 31. Future WorkBetter coding of aminoacidsLarger sequences / low complexityGoing beyond structureClever alphabet that explore similaritiesMultiobjective Genetic Algorithms
  32. 32. Beyond Multiple Alignments• Alligments are good … but expensive and boring ...• Information contained in a multiple alignment can, in principle, be expressed using an adequate aminoacid coding scheme Sensibility• How? Genetic Algorithm
  33. 33. Coded Amino Acids Alanine (A) Arginine (R) Asparagine (N) Aspartic Acid (D) Cysteine (C)Glutamic Acid (E) Glutamine (Q) Glycine (G) Histidine (H) Isoleucine (I) Leucine (L) Lysine (K) Methionine (M) Phenylalanine (F) Proline (P) Serine (S) Threonine (T) Tryptophan Tyrosine (Y) Valine (V) http://www.chemie.fu-berlin.de/chemistry/bio/
  34. 34. ArchDB databaseProtein Data Bank (PDB)http://www.rcsb.org contains ~ 25 000proteins with known structure of ~ 106entries in SWISS-PROTArchDB ~ 20 000 classified loops
  1. ¿Le ha llamado la atención una diapositiva en particular?

    Recortar diapositivas es una manera útil de recopilar información importante para consultarla más tarde.

×