Thesis def

399 views

Published on

Jays thesis defense on database federation, data marts and the power of integrated protein bioinformatics.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
399
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Data is the evidence of our measurements, essentially useless, except for book keeping. Information is data that is meaninfgul ; has relationships, and context. Knowledge is readily useful factual models and descriptions. Like the model of CDC2’s role in the G2/M transition.The reason why we do computational biology is that the vast amount of proteins and networks in the cell cannotPossibly be “held” in the mind of a human being – 20,000 proteins, easily 10,000 in a liver cell, with concentrations of up To 1 million per cell. Promiscuiity and regulation, which determine cell fate and physiology cannot be readily analyzed by any one person.Thus , we have the “proteome” – the collection of relationships , sequences, and structures that allow us to make classify and make general conclusionsAbout the specific networks of protein driven processes in the cell……
  • 1700s linnaeus : classification of life forms is more important than just tallying them up.1980s doolittle : archiving proteins is not of value unless we classify them in a non redundant manner that’s consistent with how proteins evolved, via duplication.Doolittle, interested in gene duplication, not a computer guy, built NEWAT as a new version of margaretdayhoffs atlas and encouragd people to use it For protein-centric sequence searching…. And was able to find that different proteins in different organisms shared common features on a grand scale.
  • Where are we now ? We now know that doolittle was right – the human genome is highly modular, with one of the highest enrichments of multidomainProteins of any organisms. Maybe by integrating information, we can transfer informattion between proteins more efficiently and effectively….thus decreasing the gap betweenSequence data and sequence knowledge….
  • Thesis def

    1. 1. Data Marts Integrate the ProteomeJay Vyas<br />
    2. 2. The Information Content of the Proteome<br /> Knowledge<br /> Information<br /> Data<br />1) cdc2+, cyclinB+, Mitosis, <br />2) cdc2-, Arrest<br />3) cdc2 Binds Importin alpha/beta.<br />… <br />
    3. 3. Evolution of a Relational Proteome<br />NCBI<br />PDB<br />SCOP<br />PDGF-VSIS<br />…<br />1965 1975 1985 1995 2005<br />HGP<br />Insulin<br />Atlas<br />Smith<br />Waterman;<br />NEWAT<br />Needleman Wunsch<br /> REFSEQ<br />SWISSPROT<br />Protein<br />Domains<br />
    4. 4. Data vs. Knowledge<br />Data > Information<br />Sequences<br />Structures/Functions<br />http://bytesizebio.net/http://www.dna.affrc.go.jp/growth/images/P-grwth-entrs.gifPLoS Comput Biol. 2006 Aug 25;2(8):e114. Epub 2006 Jul 14.Genome Res. 2008 March; 18(3): 449–461. doi: 10.1101/gr.6943508.http://www.rcsb.org/pdb/statistics/contentGrowthChart.do?content=fold-scop<br />
    5. 5. An Integrated Framework for building Molecular Biological Data Marts<br />Putting the model to use …<br />
    6. 6. Data Marts : Targeted Integration FlatData Repositories<br />function<br />structure<br />sequence<br />taxonomy<br />
    7. 7. A Family of Data Driven Molecular Biology Tools <br /> Integrated of structure calculation via NMR.<br /> -hybrid methods, iterative processing, reproducibility<br />spectra,sequence,chemical shifts -> structure<br /> Automated detection of signaling/binding motifs in a candidate protein.<br />protein sequence -> biological activity<br /> Filtration of “passenger” residues from specificity/functional residues <br /> on surfaces of protein structures . <br />sequence + structure - > function <br /> “Multidimensional” Sequence Comparison<br />sequence + taxonomy -> evolution<br />
    8. 8. Sequence + Spectrum -> Structure<br />
    9. 9. CONNJUR WB integrates format conversion, data inspection, and integrative processing . . . . <br />Connjur-WB<br />RNMRTK<br />NMRPIPE<br />CONNJURWB<br />J Bio. NMR, 2011<br />
    10. 10. Detection of functional subunits in proteins<br /><ul><li>TREMBL-SwissProt
    11. 11. SwissProt vs Uniprot vs TREMBL
    12. 12. Machine Learning
    13. 13. Spearmint (+)
    14. 14. (Nuclear) Bacterial Proteins
    15. 15. Xanthippe (-)
    16. 16. Snake proteins (can’t bind ATP)
    17. 17. Domain databases</li></ul>?<br />Bioinformatics. 2004 Aug 4;20 Suppl 1:i342-7.<br />http://pir.georgetown.edu/pirwww/about/doc/tutorials/uniprot_struc.gif<br />Bioinformatics (2001) 17 (10): 920-926<br />
    18. 18. MinimotifMiner – a tool forpredicting protein function viaShort Sequence Peptide Motifs<br />+<br />_<br /><ul><li>How can we increase the size of our functional motif database without increasing the amount of false positives predicted ?</li></li></ul><li>3000+ estimated motif publications / yr…<br />Variant Pubmed searches ‘x’<br />(("amino acid motif"[TIAB])) OR ((“protein motif” [TIAB])) AND (“<x>"[PDAT] : “<x>"[PDAT])<br />
    19. 19. Relational Model of Functional Data - A Precise Model of Protein Functional Semantics.<br />BMC Genomics, 2009<br />
    20. 20. NCBI_FEDERATED + Mimosa<br />RMSD = .9<br />BMC Genomics , 2009<br />
    21. 21. A Peptide Annotation Pipeline<br />BMC Bioinformatics 2010, 11:328<br />
    22. 22. Further (GO) integration controls for the degenerate nature of motif searches<br />~400<br />~400<br />~900<br />PLOS One, 2010<br />
    23. 23. Short Sequences are degenerate…Can they be merged withstructural and evolutionaryinformation ?<br />Chemistry & Biology, January 2000<br />BMC Genomics, 2009<br />
    24. 24. Venn : An Integrated ApplicationFor Database Driven HomologyThreading of Protein Structures ….<br />Nucleic Acids Research, 2009<br />Trends in Plant sciences, 2010<br />
    25. 25.  VENN : "Twilight Zone"  Sequence Homology Threading<br />NAR, 2009<br />
    26. 26. VENN-InterfaceMiner : How do different SH3 binding peptides  functionally relate to one another ?<br />Left to right … <br />1AZG (Human FYN) PRPLPVAP LYYGDWIPSNY<br />1AVZ (Human FYN) TPQVPL YD … GDWPSNY<br />1PRL (Chicken FYN) APPLPR YD ... WPNY (not shown)<br />1H3H (Mouse GRB2) SRSTK ENPSWWTLPANY<br />
    27. 27. Standard BLASTSearches<br />
    28. 28. SSPEs reside in the “Twilight Zone” <br />J. Bacteriology 2011<br />
    29. 29. What happens when a sequence is inherently noisy ?<br />max 100-250  eval 10E-3 ...  word size3-5  score matrix 80,62,30  gap?0,4    Q/N?    <br />manskysktdvqqvkrqnqqsasgqgqygtef<br />gsetdaqqvrkqnqsaeqnkqqns<br />
    30. 30. Sequence mining in 2D<br />
    31. 31. Use a hypersensitive sequence search(+), and<br /> expand results into a 2nd dimension (-).<br />Combined with taxonomical information <br />To pinpoint a first estimate of the gene’s appearance.<br />J. Bacteriology 2011<br />
    32. 32. R3 : A prototypical methodfor improved structure calculation.<br />
    33. 33. R3: Convergence is generally improved by reseeding<br />
    34. 34. Availability<br />Sequence , Structure<br />Sequence , Function<br />Structure<br />Sequence<br />Taxonomy<br />Function , Specificity <br />Sequence<br />Taxonomy , Evolution <br />www.connjur.org<br />mnm.engr.uconn.edu<br />venn.vcell.uchc.edu<br />www.bio-toolkit.com<br />
    35. 35. NCBI_FEDERATED + EXPERT SYSTEM<br />RMSD = .9<br />BMC Genomics , 2009<br />
    36. 36. VENN : Fine grained analysis.<br />Nuc. Acids Research, 2009<br />
    37. 37. NCBI_FEDERATED : Taxonomy, Domain, Homologene & Refseq.<br />Residue enrichment profiles.<br />
    38. 38. VENN : Fine grained analysis of SH3 bound peptides--- reveals a similar interface for divergent sequences. Are the peptides similar to ?<br />Left to right … <br />1AZG (Human FYN) PRPLPVAP LYYGDWIPSNY<br />1AVZ (Human FYN) TPQVPL YD … GDWPSNY<br />1PRL (Chicken FYN) APPLPR YD ... WPNY<br />1H3H (Mouse GRB2) SRSTK ENPSWWTLPANY<br />
    39. 39. Solution : Use an hypersensitive sequence search, and expand results into a 2nd dimension.<br />Combined with taxonomical information pinpoints a first estimate of the gene’s appearance.<br />
    40. 40. Gene Duplication, Domain Reuse, Functional Motifs, and Varaince of Structural Specificity<br /> <br /> <br />- "Twilight Zone" homologies <br /> <br />- Structural Interfaces <br />- Binding Specificity <br />- Short Functional Motifs <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br />Vertebrates appear to have arranged pre-existing components into a richer collection of domain architectures.<br /> <br />                            Nature 2001 <br />
    41. 41. Doolittle<br />* Functional Protein Bioinformatics<br />    - CDD, MnM, Modular evolution of Proteins <br />* Database Normalization <br />    - "Archival" -> low S/N ; unrepresentative<br /> <br />* Protein-centric sequence searching <br />   - Rous Sarcoma Discovery (DNA, lost in          <br />     translation) <br />***** All done before modern computing/database theory. <br />
    42. 42. The Modern Age    <br />Gen Bank  - archival <br />  <br />NCBI / EBI - sequence data curation<br />PDB/BMRB - structural data curation, deposition<br />GO - functional annotations <br />...............................<br />
    43. 43. What is data modelling ?<br />- Ambiguety vs. Vagueness <br />- "Text" vs "Syntax" <br />- Biological Data : No clear "reference object".<br />    Solution : CONTEXT<br />
    44. 44. Integration Strategies<br />Database Federation<br />Architectures <br />Data Warehousing<br /> <br /> <br /> <br />Data Marts<br />
    45. 45. When To Federate ? <br />* New Genomes... Draft sequences.<br />* Reproducibility is less important than insight.<br /> <br />
    46. 46. Stark et Al. <br />Control of the G2/M Transition<br />2006<br />
    47. 47. Problem: There are hundreds of native peptides which possess subsequences which are predicted to have SH3 binding properties. For example [KR]..[KR] and P..P are known to interact with SH3 domains.  But there is no method for comparing the structural binding mechanisms behind these variant peptides.  This is necessary, given the fact that there are hundreds of SH3 domains in the human genome, with several diverse structures existing in the protein data bank, which cannot be collectively analyzed by eye.<br />Solution: Use the VENN program for homology titration to extract molecular interfaces from SH3 bound peptides.<br />1) For each atom “a1” in each peptide chain of a structure<br />For each atom in “a2” DIFFERENT chain of the same structure.<br />Is “a1” close to “a2” ?<br />If yes, store a1,a2.<br />If no, keep going.<br />2) Now, create a “synthetic structure”, which extracts residues associated with only atoms stored in step (1), which ignores covalent peptide bonds entirely. This structure represents a molecular interface, where all non interacting residues are considered to be “extraneous noise”.<br />3) To test the biological relevance of the molecular interface, apply it to varying species : Is the same signature generated from different structures ?<br />Conclusion:<br />Although the W/P/N/Y residues in SH3 domains are far apart and variably spaced in sequence distance, they may have evolved to possess a common feature : Conformance to a highly specific molecular interface. <br />Mouse GRB2 / Human FYN are completely different domains, in different species, which bind different peptides …. Yet surprisingly, their binding sites conform to the same interface.<br />Venn is available at <br />http://sbtools.uchc.edu/venn.<br />Results <br />Left to right … <br />1AZG (Human FYN) PRPLPVAP bindsLYYGDWIPSNY<br />1AVZ (Human FYN) TPQVPL bindsYD … GDWPSNY<br />1PRL (Chicken FYN) APPLPR binds YD ... WPNY<br />1H3H (Mouse GRB2) SRSTK binds ENPSWWTLPANY<br />
    48. 48. Orthologous Homology Threading : Course Grained Function . . .<br />
    49. 49. Do canonical binding motifs in proteins exhibit structural specificity before when unbound ? <br />8000 distinct pdb chains (out of 35000 total structures). <br /><ul><li>SH3 Bound non PXXPs
    50. 50. 1AZG PLPV 137
    51. 51. 1AZG PRPL 107
    52. 52. 1PRL PPLP 150
    53. 53. 1PRL PLPR 154
    54. 54. Non SH3 complexedPXXPs
    55. 55. 2DJY PPPP 89
    56. 56. 1WA7 PGMP 111
    57. 57. Non SH3 bound, non PXXP
    58. 58. 2ORU PATG 817</li></li></ul><li>history<br />Human Genome - 2001<br />SCOP - 1994<br />SwissProt/NCBI - 1986/1988<br />NEWAT - 1981 PDGF ~ v sis - 1983<br />Smith-Waterman - 1981<br />                                                      PDB - 1973<br />Needleman-Wunsch - 1970<br />ATLAS - 1965<br />Insulin Sequence - 1955<br />Double Helix  - 1953<br />

    ×