Slideshow transcript
Slide 1: Protein function and bioinformatics Outline of talk Why do we need bioinformatics? ● What tools do we need? ● Case study: The Methanococcoides burtonii genome ● Neil Saunders 76-455 n.saunders@uq.edu.au www.uq.edu.au/~uqnsaun1/
Slide 2: Protein function and bioinformatics Why do we need bioinformatics? Rapid increase in data due to genomics ● Too much data to characterise genes/proteins individually ● Bioinformatics = “smart use” of information ● Ideally, computational and experimental biology are partners ●
Slide 3: Protein function and bioinformatics The ideal computational – wet lab cycle Biological system Biological objects Experiments Computational objects Biological inferences Analyses Bioinformatics is about helping biologists solve problems
Slide 4: Protein function and bioinformatics Introduction to genomics Genomes Online database www.genomesonline.org ● Published/complete 413 Bacteria in progress 977 Eukarya in progress 629 Archaea in progress 57 Metagenomes 56 10-50% of genes in a new genome may have no known function
Slide 5: Protein function and bioinformatics Computational skills for genomics "So what new skills will postdocs need to ensure that they don't become science relics? The answer is math, statistics, and knowledge of a scripting language for computers." The Scientist, "Bioinformatics Knowledge Vital to Careers" Volume 16 | Issue 17 | 53 | Sep. 2, 2002 www.thescientist.com
Slide 6: Protein function and bioinformatics Using WWW resources The best web resources provide: ● - useful tools for analysis - integrated data from many sources Good examples InterPro database http://www.ebi.ac.uk/interpro/ ● Expasy http://au.expasy.org ● UniProt http://www.uniprot.org/ ● CBS Prediction servers http://www.cbs.dtu.dk/services/ ● IMG Database http://img.jgi.doe.gov/ ● But... Web services no good for genome-scale analyses ● Usually limits to data input (with good reason) ● Nucleic Acids Research publishes annual database and web servers editions: http://nar.oxfordjournals.org/
Slide 7: Protein function and bioinformatics Computational infrastructure for genomics Biological Analysis objects (limitless) Genome Sequence analysis Assembly Regulatory motifs Computational objects Gene sequence Structural modeling Protein sequence Phylogeny Protein structure Comparative genomics Pathway Pathway reconstruction Key points Appropriate hardware: workstation v. cluster ● Linux Linux Linux! ● Freely-available, open source software is all you need ● Toolkits and libraries (e.g. BioPerl) to build your own solutions ● Philosophy of “many small tools plus glue” - scripting language ● Website + database skills - sharing ●
Slide 8: Protein function and bioinformatics BioPerl: a life sciences computational toolkit Website: http://www.bioperl.org ● A collection of Perl modules for biology ● Handles many common tasks in sequence/structure analysis, e.g. ● - read/write various sequence formats - run BLAST and parse the output - read/write/analyse sequence alignments - access local or remote databases
Slide 9: Protein function and bioinformatics Annotation (or not) using BLAST BLAST: Basic Local Alignment and Search Tool Is useful for finding similar sequences quickly ● Not sensitive – less useful for weakly-similar sequences ● Not much good at all for annotation ● Why not? “Hypothetical”: the database sequence is unique ● “Conserved hypothetical”: several hits but no known function ● Multi-domain proteins ● BLAST database contains incorrect annotations ● Annotation is at the whim of whoever deposited the sequence ● Classic example: IMPDH Wu et al. (2003) Comp. Biol. Chem. 27: 37-47
Slide 10: Protein function and bioinformatics A better annotation tool: InterProScan IPRScan is a tool to search the InterPro database ● It uses sequence signature profiles – more sensitive than BLAST ● Integrates the search results from multiple databases ● A good first step to characterise a new sequence ● Available as standalone package and runs on clusters ●
Slide 11: Protein function and bioinformatics Structure prediction: threading and modelling The structure of a protein often explains how it functions ● However, structural determination is laborious, difficult and time-consuming ● Modelling can be useful in cases sequence is similar to a known structure ● Threading Homology modelling Fit query sequence to fold database Assume similar sequence = similar structure
Slide 12: Protein function and bioinformatics Some modelling tools and databases SwissModel: http://swissmodel.expasy.org/ ● MODELLER: http://www.salilab.org/modeller/ ● PROSPECT: http://compbio.ornl.gov/structure/prospect2/ ● ModBase: http://modbase.compbio.ucsf.edu/ ●
Slide 13: Protein function and bioinformatics Introduction to M. burtonii M. burtonii Ace Lake, Vestfold Hills The Archaea Methanococcoides burtonii Isolated from Ace Lake, Antarctica (1-2 °C) ● Grows optimally at 23 °C ● Is an archaeon ● Is a psychrophilic methanogen ●
Slide 14: Protein function and bioinformatics The M. burtonii genome What features of this genome are related to cold adaptation?
Slide 15: Protein function and bioinformatics Discovery of CSP-like proteins in M. burtonii CSP = cold shock protein ● Expressed in bacteria at low temperature ● Functions as RNA chaperone to facilitate ● transcription at low temperature Present in some Archaea, including ● M. frigidum, but not M. burtonii
Slide 16: Protein function and bioinformatics Discovery of CSP-like proteins in M. burtonii Protein sequences PROSPECT thread v. CSD folds MODELLER d1sro__ M. burtonii YP_564958 structural model Both proteins are expressed (proteomics) ● Located in a putative exosome/proteasome superoperon ● This is consistent with their proposed function ●
Slide 17: Protein function and bioinformatics Integrating information: structural RNA study stems % GC all bases OGT (°C) Is tRNA GC content related to OGT? Dihydrouridine in M. burtonii tRNAScan find tRNA in genomes tRNA contains > 1 hU/tRNA ● ● GC content calculated using Perl scripts Maintains flexibility at low temperature ● ● DUS gene identified using iprscan ●
Slide 18: Protein function and bioinformatics Pyrrolysine: a problem for bioinformatics Proteomics used to identify expressed proteins ● One is trimethylamine methyltransferase (TMA-MT) ● It shows post-translational modification ● It also maps to 2 ORFs in the genome sequence ● The ORFs are actually one gene with a read-through UAG codon ● Pyrrolysine is incorporated at the UAG ● This is the 22nd genetically-encoded amino acid ●
Slide 19: Protein function and bioinformatics Statistical analysis of protein properties Archaea 27 organisms 62 338 ORFs Amino acid frequency (bioperl) Bacteria 52 organisms 165 192 ORFs data matrix organisms (rows) x composition (columns) PCA principal components (R stats package)
Slide 20: Protein function and bioinformatics Principal components analysis of composition 2 components explain most of the variation in amino acid composition ● PC1 correlates with genome GC content ● PC2 correlates with optimum growth temperature ● The psychrophilic archaea are distinguished by PC2 score ● Their proteins contain: more Gln, Ser, Thr, His, Asp ● less Leu, Trp and Glu
Slide 21: Protein function and bioinformatics Conclusions Computational biology and bioinformatics are essential to modern biology ● Many tools are available to annotate proteins: web-based ● standalone Without experiments, bioinformatics is just predictions ● Data integration is our biggest problem ● www.uq.edu.au/~uqnsaun1/



Add a comment on Slide 1
If you have a SlideShare account, login to comment; else you can comment as a guest- Favorites & Groups
Showing 1-50 of 2 (more)