Protein function and bioinformatics


Published on

Talk for the BIOC6007 course at UQ; a lot of the material is similar to the presentation on genomics of cold-adapted microorganisms.

Published in: Technology, Education
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Protein function and bioinformatics

  1. 1. Protein function and bioinformatics Outline of talk Why do we need bioinformatics? ● What tools do we need? ● Case study: The Methanococcoides burtonii genome ● Neil Saunders 76-455
  2. 2. Protein function and bioinformatics Why do we need bioinformatics? Rapid increase in data due to genomics ● Too much data to characterise genes/proteins individually ● Bioinformatics = “smart use” of information ● Ideally, computational and experimental biology are partners ●
  3. 3. Protein function and bioinformatics The ideal computational – wet lab cycle Biological system Biological objects Experiments Computational objects Biological inferences Analyses Bioinformatics is about helping biologists solve problems
  4. 4. Protein function and bioinformatics Introduction to genomics Genomes Online database ● Published/complete 413 Bacteria in progress 977 Eukarya in progress 629 Archaea in progress 57 Metagenomes 56 10-50% of genes in a new genome may have no known function
  5. 5. Protein function and bioinformatics Computational skills for genomics "So what new skills will postdocs need to ensure that  they don't become science relics? The answer is math, statistics, and knowledge of a scripting language for  computers." ­The Scientist, "Bioinformatics Knowledge Vital to Careers" Volume 16 | Issue 17 | 53 | Sep. 2, 2002 www.the­
  6. 6. Protein function and bioinformatics Using WWW resources The best web resources provide: ● - useful tools for analysis - integrated data from many sources Good examples InterPro database ● Expasy ● UniProt ● CBS Prediction servers ● IMG Database ● But... Web services no good for genome-scale analyses ● Usually limits to data input (with good reason) ● Nucleic Acids Research publishes annual database and web servers editions:
  7. 7. Protein function and bioinformatics Computational infrastructure for genomics Biological Analysis objects (limitless) Genome Sequence analysis Assembly Regulatory motifs Computational objects Gene sequence Structural modeling Protein sequence Phylogeny Protein structure Comparative genomics Pathway Pathway reconstruction Key points Appropriate hardware: workstation v. cluster ● Linux Linux Linux! ● Freely-available, open source software is all you need ● Toolkits and libraries (e.g. BioPerl) to build your own solutions ● Philosophy of “many small tools plus glue” - scripting language ● Website + database skills - sharing ●
  8. 8. Protein function and bioinformatics BioPerl: a life sciences computational toolkit Website: ● A collection of Perl modules for biology ● Handles many common tasks in sequence/structure analysis, e.g. ● - read/write various sequence formats - run BLAST and parse the output - read/write/analyse sequence alignments - access local or remote databases
  9. 9. Protein function and bioinformatics Annotation (or not) using BLAST BLAST: Basic Local Alignment and Search Tool Is useful for finding similar sequences quickly ● Not sensitive – less useful for weakly-similar sequences ● Not much good at all for annotation ● Why not? “Hypothetical”: the database sequence is unique ● “Conserved hypothetical”: several hits but no known function ● Multi-domain proteins ● BLAST database contains incorrect annotations ● Annotation is at the whim of whoever deposited the sequence ● Classic example: IMPDH Wu et al. (2003) Comp. Biol. Chem. 27: 37-47
  10. 10. Protein function and bioinformatics A better annotation tool: InterProScan IPRScan is a tool to search the InterPro database ● It uses sequence signature profiles – more sensitive than BLAST ● Integrates the search results from multiple databases ● A good first step to characterise a new sequence ● Available as standalone package and runs on clusters ●
  11. 11. Protein function and bioinformatics Structure prediction: threading and modelling The structure of a protein often explains how it functions ● However, structural determination is laborious, difficult and time-consuming ● Modelling can be useful in cases sequence is similar to a known structure ● Threading Homology modelling Fit query sequence to fold database Assume similar sequence = similar structure
  12. 12. Protein function and bioinformatics Some modelling tools and databases SwissModel: ● MODELLER: ● PROSPECT: ● ModBase: ●
  13. 13. Protein function and bioinformatics Introduction to M. burtonii M. burtonii Ace Lake, Vestfold Hills The Archaea Methanococcoides burtonii Isolated from Ace Lake, Antarctica (1-2 °C) ● Grows optimally at 23 °C ● Is an archaeon ● Is a psychrophilic methanogen ●
  14. 14. Protein function and bioinformatics The M. burtonii genome What features of this genome are related to cold adaptation?
  15. 15. Protein function and bioinformatics Discovery of CSP-like proteins in M. burtonii CSP = cold shock protein ● Expressed in bacteria at low temperature ● Functions as RNA chaperone to facilitate ● transcription at low temperature Present in some Archaea, including ● M. frigidum, but not M. burtonii
  16. 16. Protein function and bioinformatics Discovery of CSP-like proteins in M. burtonii Protein sequences PROSPECT thread v. CSD folds MODELLER d1sro__ M. burtonii YP_564958 structural model Both proteins are expressed (proteomics) ● Located in a putative exosome/proteasome superoperon ● This is consistent with their proposed function ●
  17. 17. Protein function and bioinformatics Integrating information: structural RNA study stems % GC all bases OGT (°C) Is tRNA GC content related to OGT? Dihydrouridine in M. burtonii tRNAScan find tRNA in genomes tRNA contains > 1 hU/tRNA ● ● GC content calculated using Perl scripts Maintains flexibility at low temperature ● ● DUS gene identified using iprscan ●
  18. 18. Protein function and bioinformatics Pyrrolysine: a problem for bioinformatics Proteomics used to identify expressed proteins ● One is trimethylamine methyltransferase (TMA-MT) ● It shows post-translational modification ● It also maps to 2 ORFs in the genome sequence ● The ORFs are actually one gene with a read-through UAG codon ● Pyrrolysine is incorporated at the UAG ● This is the 22nd genetically-encoded amino acid ●
  19. 19. Protein function and bioinformatics Statistical analysis of protein properties Archaea 27 organisms 62 338 ORFs Amino acid frequency (bioperl) Bacteria 52 organisms 165 192 ORFs data matrix organisms (rows) x composition (columns) PCA principal components (R stats package)
  20. 20. Protein function and bioinformatics Principal components analysis of composition 2 components explain most of the variation in amino acid composition ● PC1 correlates with genome GC content ● PC2 correlates with optimum growth temperature ● The psychrophilic archaea are distinguished by PC2 score ● Their proteins contain: more Gln, Ser, Thr, His, Asp ● less Leu, Trp and Glu
  21. 21. Protein function and bioinformatics Conclusions Computational biology and bioinformatics are essential to modern biology ● Many tools are available to annotate proteins: web-based ● standalone Without experiments, bioinformatics is just predictions ● Data integration is our biggest problem ●