Three groups annotated the genome of Mycoplasma genitalium and found inconsistencies in their annotations. Of the 468 genes, 318 were annotated consistently by all three groups but 45 had conflicting annotations. Errors likely arose from insufficient sequence similarity to determine homology accurately or incorrectly inferring function based on homology alone. Database curation is needed to prevent propagation of erroneous annotations.
Comparative genomics involves systematically comparing genome sequences from different organisms. It uses computer programs to identify homologous genomic regions and align sequences at the base-pair level. Comparing genomes at different phylogenetic distances can provide insights into gene structure/function, evolution, and characteristics unique to each organism. Key tools for comparative genomics include genome browsers, aligners, and databases that classify orthologous gene clusters conserved across species.
This document provides a summary of a seminar on comparative genomics techniques. It discusses three levels of genome research: structural genomics, functional genomics, and comparative genomics. Comparative genomics involves analyzing and comparing different genomes to study gene content, function, organization, and evolution. Techniques discussed include genome sequencing, mapping, and bioinformatics tools. The document also outlines what can be compared between genomes and how comparative genomics has provided insights into evolution and gene function.
This document discusses various bioinformatics tools and methods for identifying genes from genomic sequences. It begins by defining genes and genomes, then describes reference databases like RefSeq that are important for gene identification. It outlines the general workflow for gene identification, including obtaining sequences, preprocessing, annotation, prediction, and validation. Specific tools mentioned include GENSCAN, Glimmer, and Augustus for gene prediction, and BLAST for sequence alignment. The document also discusses identifying other genomic features like promoters, repeats, and open reading frames. It emphasizes that accurate gene identification requires both computational and experimental approaches.
Protein threading is a protein structure prediction method that involves "threading" or placing an amino acid sequence into known protein structure templates to find the best matching fold. The key steps are:
1) A query sequence is threaded into structural positions of templates from a structure library to find sequence-structure alignments
2) Alignments are scored and optimized using an objective function accounting for residue interactions and preferences
3) The highest scoring template is selected as the predicted structure, though loop regions are often not accurately predicted
Introduction to sequence alignment partiiSumatiHajela
This document provides an introduction to sequence alignment and discusses gaps and gap penalties. It defines a match and gap in sequence alignment and how substitutions, deletions and insertions are represented. It describes different types of gaps including constant, linear, affine, convex and profile-based variable penalties. Highlights include that gaps allow alignment extension and introduce uncertainty, so penalties are used. Examples demonstrate assigning regular and affine gap penalties.
This document discusses genomic databases. It begins by defining key terms like genes, genomes, and genomics. It then describes categories of biological databases including those for nucleic acid sequences, proteins, structures, and genomes. It provides many examples of genomic databases for both non-vertebrate and vertebrate species, including databases for bacteria, fungi, plants, invertebrates, and humans. The final sections note that genomic databases collect genome-wide data from various sources and that databases can be specific to a single organism or category of organisms.
Structural genomics is a field that aims to determine the 3D structures of all proteins encoded by a genome. It involves determining structures on a large scale using techniques like X-ray crystallography and NMR. This allows identification of novel protein folds and potential drug targets. Comparative genomics compares genomic features between organisms and provides insights into evolution and conserved sequences and functions. It is a key tool in fields like medicine and agriculture.
The document discusses Prosite, a database of protein family signatures that can be used to determine the function of uncharacterized proteins. It contains patterns and profiles formulated to identify which known protein family a new sequence belongs to. The Prosite database consists of two files - a data file containing information for scanning sequences, and a documentation file describing each pattern and profile. New Prosite entries are mainly profiles developed by collaborators at the SIB Swiss Institute of Bioinformatics to identify distantly related proteins based on conserved residues.
Comparative genomics involves systematically comparing genome sequences from different organisms. It uses computer programs to identify homologous genomic regions and align sequences at the base-pair level. Comparing genomes at different phylogenetic distances can provide insights into gene structure/function, evolution, and characteristics unique to each organism. Key tools for comparative genomics include genome browsers, aligners, and databases that classify orthologous gene clusters conserved across species.
This document provides a summary of a seminar on comparative genomics techniques. It discusses three levels of genome research: structural genomics, functional genomics, and comparative genomics. Comparative genomics involves analyzing and comparing different genomes to study gene content, function, organization, and evolution. Techniques discussed include genome sequencing, mapping, and bioinformatics tools. The document also outlines what can be compared between genomes and how comparative genomics has provided insights into evolution and gene function.
This document discusses various bioinformatics tools and methods for identifying genes from genomic sequences. It begins by defining genes and genomes, then describes reference databases like RefSeq that are important for gene identification. It outlines the general workflow for gene identification, including obtaining sequences, preprocessing, annotation, prediction, and validation. Specific tools mentioned include GENSCAN, Glimmer, and Augustus for gene prediction, and BLAST for sequence alignment. The document also discusses identifying other genomic features like promoters, repeats, and open reading frames. It emphasizes that accurate gene identification requires both computational and experimental approaches.
Protein threading is a protein structure prediction method that involves "threading" or placing an amino acid sequence into known protein structure templates to find the best matching fold. The key steps are:
1) A query sequence is threaded into structural positions of templates from a structure library to find sequence-structure alignments
2) Alignments are scored and optimized using an objective function accounting for residue interactions and preferences
3) The highest scoring template is selected as the predicted structure, though loop regions are often not accurately predicted
Introduction to sequence alignment partiiSumatiHajela
This document provides an introduction to sequence alignment and discusses gaps and gap penalties. It defines a match and gap in sequence alignment and how substitutions, deletions and insertions are represented. It describes different types of gaps including constant, linear, affine, convex and profile-based variable penalties. Highlights include that gaps allow alignment extension and introduce uncertainty, so penalties are used. Examples demonstrate assigning regular and affine gap penalties.
This document discusses genomic databases. It begins by defining key terms like genes, genomes, and genomics. It then describes categories of biological databases including those for nucleic acid sequences, proteins, structures, and genomes. It provides many examples of genomic databases for both non-vertebrate and vertebrate species, including databases for bacteria, fungi, plants, invertebrates, and humans. The final sections note that genomic databases collect genome-wide data from various sources and that databases can be specific to a single organism or category of organisms.
Structural genomics is a field that aims to determine the 3D structures of all proteins encoded by a genome. It involves determining structures on a large scale using techniques like X-ray crystallography and NMR. This allows identification of novel protein folds and potential drug targets. Comparative genomics compares genomic features between organisms and provides insights into evolution and conserved sequences and functions. It is a key tool in fields like medicine and agriculture.
The document discusses Prosite, a database of protein family signatures that can be used to determine the function of uncharacterized proteins. It contains patterns and profiles formulated to identify which known protein family a new sequence belongs to. The Prosite database consists of two files - a data file containing information for scanning sequences, and a documentation file describing each pattern and profile. New Prosite entries are mainly profiles developed by collaborators at the SIB Swiss Institute of Bioinformatics to identify distantly related proteins based on conserved residues.
Structural genomics aims to determine the 3D structure of all proteins in a genome. It uses high-throughput methods like X-ray crystallography and NMR on a genomic scale. This allows determination of protein structures for entire proteomes. It provides insights into protein function and can aid drug discovery by identifying potential drug targets like in Mycobacterium tuberculosis. Structural genomics leverages completed genome sequences to clone and express all encoded proteins for structural characterization.
An introduction to promoter prediction and analysisSarbesh D. Dangol
This document provides an introduction to promoter prediction and analysis in plants. It discusses what promoters are, including their cis-acting elements and core promoter regions. It describes different types of promoters such as constitutive, spatiotemporal, and inducible promoters. It also discusses models for finding binding sites in promoters and experimental approaches for identifying regulatory elements like chromatin immunoprecipitation. Finally, it mentions some bioinformatics tools and databases that can be used for promoter analysis.
This document provides an overview and introduction to RNA-seq analysis using Next Generation Sequencing. It discusses the RNA-seq workflow including mapping reads with TopHat2, transcript assembly with Cufflinks, and differential expression analysis. Key points covered include the advantages of RNA-seq over microarrays, the exponential drop in sequencing costs, mapping strategies for junction reads including TopHat, and running TopHat from the command line.
ESTs are short sequences of DNA that represent genes expressed in certain tissues or organisms. They provide a quick and inexpensive way for scientists to discover new genes and map their positions in genomes. ESTs represent a snapshot of genes expressed in a tissue at a given time. Sequencing the beginning or end of cDNA clones produces 5' and 3' ESTs, which can help identify genes and study gene expression and regulation.
The document discusses BLAST (Basic Local Alignment Search Tool), an algorithm used to compare a query DNA or protein sequence against a database of sequences. BLAST works by identifying exact or approximate matches between words of 3-11 letters in the query and database sequences. Matches are extended to find local alignments with high scores. Significant alignments are identified based on their score and the expected number of matches by chance (E-value). The document provides examples of how BLAST finds local alignments and calculates E-values. It also describes different BLAST programs and suggestions for using BLAST.
Microarray technology allows researchers to analyze gene expression levels on a genomic scale. DNA microarrays contain many genes arranged on a slide that can be used to detect differences in gene expression between samples. The microarray workflow involves sample preparation, hybridization of labeled cDNA to the array, image scanning, data normalization and statistical analysis to identify differentially expressed genes between conditions. Multiple testing is a challenge and statistical methods must account for false positives and negatives.
This document summarizes different computational methods for protein structure prediction, including homology modeling, fold recognition, threading, and ab initio modeling. Homology modeling relies on identifying proteins with similar sequences and known structures. Fold recognition and threading can be used when there are no homologs, to identify proteins with the same overall fold but different sequences. Ab initio modeling uses physics-based modeling and protein fragments to predict structure from sequence alone, and has challenges due to the vast number of possible conformations.
This document discusses different methods for genome sequencing and assembly, including restriction enzyme fingerprinting, marker sequences, and hybridization assays. It focuses on using marker sequences like sequence-tagged sites (STS), expressed sequence tags (ESTs), untranslated regions (UTRs), and single nucleotide polymorphisms (SNPs) to map genomes. Large-insert cloning vectors like BACs and PACs can be used with restriction enzyme fingerprinting and FPC software to assemble contigs and map genomes at a large scale. Marker sequences provide a dense set of physical markers to build accurate physical maps of genomes.
The CATH database hierarchically classifies protein domains obtained from protein structures deposited in the Protein Data Bank. Domain identification and classification uses both manual and automated procedures. CATH includes domains from structures determined at 4 angstrom resolution or better that are at least 40 residues long with 70% or more residues having defined side chains. Submitted protein chains are divided into domains, which are then classified in CATH.
Introduction
Transcriptome analysis
Goal of functional genomics
Why we need functional genomics
Technique
1. At DNA level
2.At RNA level
3. At protein level
4. loss of function
5. functional genomic and bioinformatics
Application
Latest research and reviews
Websites of functional genomics
Conclusions
Reference
Comparative genomics involves comparing genomes to discover similarities and differences. It can provide insights into evolutionary relationships, help predict gene function, and aid in drug discovery. The first step is often aligning genome sequences using tools like BLAST or MUMmer. Genomes can then be compared at various levels, such as overall nucleotide statistics, genome structure, and coding/non-coding regions. Comparing gene and protein content across genomes helps predict functions. Conserved genomic features across species also aid prediction. Insights into genome evolution come from studying molecular events like inversions and duplications. Comparative genomics has impacted phylogenetics and drug target identification.
Comparative genomics in eukaryotes, organellesKAUSHAL SAHU
Comparative genomics involves comparing the genomic features of different organisms, such as DNA sequences, genes, and gene order. This field has revealed both similarities and differences between organisms that can provide insights into evolutionary relationships. Some of the first comparative genomic studies compared large DNA viruses. Since then, many complete genome sequences have been determined, including for yeast, fruit flies, worms, plants, mice, and humans. While humans have around 35,000 genes, complexity is not solely due to gene number. Comparative analysis of human and mouse genomes shows 40% sequence similarity and similar gene numbers, but different genome sizes. Mitochondrial genomes also yield insights when compared between domains of life. Computational tools like BLAST are used to facilitate genomic
Genomic DNA libraries contain representative copies of all DNA fragments in an organism's genome, including both expressed and non-expressed sequences. They are constructed by isolating genomic DNA, fragmenting it, and cloning the fragments into suitable vectors like lambda phage or BACs. cDNA libraries contain only expressed sequences, as they are constructed by isolating mRNA from tissues, reverse transcribing it to cDNA, and cloning the cDNA fragments. Both library types are useful for gene discovery, sequencing, mapping genomes, and studying regulatory sequences.
This document provides an overview of functional genomics and methods for transcriptome analysis. It discusses two main approaches - sequence-based approaches like expressed sequence tags (ESTs) and serial analysis of gene expression (SAGE), and microarray-based approaches. For sequence-based approaches, it describes how ESTs can provide gene discovery and expression information but have limitations. It outlines the SAGE methodology and gene index construction to organize EST data. For microarrays, it summarizes the basic workflow including sample preparation, hybridization, image analysis and data normalization to identify differentially expressed genes through statistical tests.
DNA SEQUENCING METHODS AND STRATEGIES FOR GENOME SEQUENCINGPuneet Kulyana
This presentation will give you a brief idea about the various DNA sequencing methods and various strategies used for genome sequencing and much more vital information related to gene expression and analysis
Functional genomics uses genome-wide experimental approaches to assess gene function on a large scale. It analyzes gene expression through techniques like transcriptomics and proteomics. Transcriptomics analyzes gene expression profiles through RNA sequencing or microarray analysis. Microarray analysis involves hybridizing fluorescently-labeled cDNA or cRNA to microarrays containing DNA probes to measure gene expression levels across thousands of genes simultaneously. Functional genomics provides a global understanding of gene function and molecular interactions through integrated omics approaches.
Gene prediction is the process of determining where a coding gene might be in a genomic sequence. Functional proteins must begin with a Start codon (where DNA transcription begins), and end with a Stop codon (where transcription ends).
A brief introduction to two techniques used to study protein interactions: Yeast two hybrid (Y2H) system and Chromatin immunoprecipitation(ChIP)
I hope it helps and please comment if I've made any mistakes.
The document summarizes key aspects of amino acids and protein structure in 3 paragraphs or less:
Amino acids are the building blocks of proteins. They contain common structural features and exist in L- and D-forms. In proteins, amino acids are exclusively in the L-conformation. Amino acids are classified based on the properties of their side chains into nonpolar, aromatic, polar, positively charged, and negatively charged categories.
Protein structure is hierarchical, progressing from primary to secondary, tertiary, and quaternary levels. The primary structure is the amino acid sequence. Secondary structures include alpha helices, beta sheets, and turns formed by hydrogen bonding. Tertiary structure refers to the overall 3
A talk that I gave to a a general audience at UC Davis. Slides were also used for Prof. Ian Korf's presentation at the Genome 10K workshop (May 25th, 2013). This talk mostly concerns the results of the Assemblathon 2 contest, but also covers other issues relating to genome assembly.
Note, this talk has been superseded by updated versions (also available on slideshare)!
Structural genomics aims to determine the 3D structure of all proteins in a genome. It uses high-throughput methods like X-ray crystallography and NMR on a genomic scale. This allows determination of protein structures for entire proteomes. It provides insights into protein function and can aid drug discovery by identifying potential drug targets like in Mycobacterium tuberculosis. Structural genomics leverages completed genome sequences to clone and express all encoded proteins for structural characterization.
An introduction to promoter prediction and analysisSarbesh D. Dangol
This document provides an introduction to promoter prediction and analysis in plants. It discusses what promoters are, including their cis-acting elements and core promoter regions. It describes different types of promoters such as constitutive, spatiotemporal, and inducible promoters. It also discusses models for finding binding sites in promoters and experimental approaches for identifying regulatory elements like chromatin immunoprecipitation. Finally, it mentions some bioinformatics tools and databases that can be used for promoter analysis.
This document provides an overview and introduction to RNA-seq analysis using Next Generation Sequencing. It discusses the RNA-seq workflow including mapping reads with TopHat2, transcript assembly with Cufflinks, and differential expression analysis. Key points covered include the advantages of RNA-seq over microarrays, the exponential drop in sequencing costs, mapping strategies for junction reads including TopHat, and running TopHat from the command line.
ESTs are short sequences of DNA that represent genes expressed in certain tissues or organisms. They provide a quick and inexpensive way for scientists to discover new genes and map their positions in genomes. ESTs represent a snapshot of genes expressed in a tissue at a given time. Sequencing the beginning or end of cDNA clones produces 5' and 3' ESTs, which can help identify genes and study gene expression and regulation.
The document discusses BLAST (Basic Local Alignment Search Tool), an algorithm used to compare a query DNA or protein sequence against a database of sequences. BLAST works by identifying exact or approximate matches between words of 3-11 letters in the query and database sequences. Matches are extended to find local alignments with high scores. Significant alignments are identified based on their score and the expected number of matches by chance (E-value). The document provides examples of how BLAST finds local alignments and calculates E-values. It also describes different BLAST programs and suggestions for using BLAST.
Microarray technology allows researchers to analyze gene expression levels on a genomic scale. DNA microarrays contain many genes arranged on a slide that can be used to detect differences in gene expression between samples. The microarray workflow involves sample preparation, hybridization of labeled cDNA to the array, image scanning, data normalization and statistical analysis to identify differentially expressed genes between conditions. Multiple testing is a challenge and statistical methods must account for false positives and negatives.
This document summarizes different computational methods for protein structure prediction, including homology modeling, fold recognition, threading, and ab initio modeling. Homology modeling relies on identifying proteins with similar sequences and known structures. Fold recognition and threading can be used when there are no homologs, to identify proteins with the same overall fold but different sequences. Ab initio modeling uses physics-based modeling and protein fragments to predict structure from sequence alone, and has challenges due to the vast number of possible conformations.
This document discusses different methods for genome sequencing and assembly, including restriction enzyme fingerprinting, marker sequences, and hybridization assays. It focuses on using marker sequences like sequence-tagged sites (STS), expressed sequence tags (ESTs), untranslated regions (UTRs), and single nucleotide polymorphisms (SNPs) to map genomes. Large-insert cloning vectors like BACs and PACs can be used with restriction enzyme fingerprinting and FPC software to assemble contigs and map genomes at a large scale. Marker sequences provide a dense set of physical markers to build accurate physical maps of genomes.
The CATH database hierarchically classifies protein domains obtained from protein structures deposited in the Protein Data Bank. Domain identification and classification uses both manual and automated procedures. CATH includes domains from structures determined at 4 angstrom resolution or better that are at least 40 residues long with 70% or more residues having defined side chains. Submitted protein chains are divided into domains, which are then classified in CATH.
Introduction
Transcriptome analysis
Goal of functional genomics
Why we need functional genomics
Technique
1. At DNA level
2.At RNA level
3. At protein level
4. loss of function
5. functional genomic and bioinformatics
Application
Latest research and reviews
Websites of functional genomics
Conclusions
Reference
Comparative genomics involves comparing genomes to discover similarities and differences. It can provide insights into evolutionary relationships, help predict gene function, and aid in drug discovery. The first step is often aligning genome sequences using tools like BLAST or MUMmer. Genomes can then be compared at various levels, such as overall nucleotide statistics, genome structure, and coding/non-coding regions. Comparing gene and protein content across genomes helps predict functions. Conserved genomic features across species also aid prediction. Insights into genome evolution come from studying molecular events like inversions and duplications. Comparative genomics has impacted phylogenetics and drug target identification.
Comparative genomics in eukaryotes, organellesKAUSHAL SAHU
Comparative genomics involves comparing the genomic features of different organisms, such as DNA sequences, genes, and gene order. This field has revealed both similarities and differences between organisms that can provide insights into evolutionary relationships. Some of the first comparative genomic studies compared large DNA viruses. Since then, many complete genome sequences have been determined, including for yeast, fruit flies, worms, plants, mice, and humans. While humans have around 35,000 genes, complexity is not solely due to gene number. Comparative analysis of human and mouse genomes shows 40% sequence similarity and similar gene numbers, but different genome sizes. Mitochondrial genomes also yield insights when compared between domains of life. Computational tools like BLAST are used to facilitate genomic
Genomic DNA libraries contain representative copies of all DNA fragments in an organism's genome, including both expressed and non-expressed sequences. They are constructed by isolating genomic DNA, fragmenting it, and cloning the fragments into suitable vectors like lambda phage or BACs. cDNA libraries contain only expressed sequences, as they are constructed by isolating mRNA from tissues, reverse transcribing it to cDNA, and cloning the cDNA fragments. Both library types are useful for gene discovery, sequencing, mapping genomes, and studying regulatory sequences.
This document provides an overview of functional genomics and methods for transcriptome analysis. It discusses two main approaches - sequence-based approaches like expressed sequence tags (ESTs) and serial analysis of gene expression (SAGE), and microarray-based approaches. For sequence-based approaches, it describes how ESTs can provide gene discovery and expression information but have limitations. It outlines the SAGE methodology and gene index construction to organize EST data. For microarrays, it summarizes the basic workflow including sample preparation, hybridization, image analysis and data normalization to identify differentially expressed genes through statistical tests.
DNA SEQUENCING METHODS AND STRATEGIES FOR GENOME SEQUENCINGPuneet Kulyana
This presentation will give you a brief idea about the various DNA sequencing methods and various strategies used for genome sequencing and much more vital information related to gene expression and analysis
Functional genomics uses genome-wide experimental approaches to assess gene function on a large scale. It analyzes gene expression through techniques like transcriptomics and proteomics. Transcriptomics analyzes gene expression profiles through RNA sequencing or microarray analysis. Microarray analysis involves hybridizing fluorescently-labeled cDNA or cRNA to microarrays containing DNA probes to measure gene expression levels across thousands of genes simultaneously. Functional genomics provides a global understanding of gene function and molecular interactions through integrated omics approaches.
Gene prediction is the process of determining where a coding gene might be in a genomic sequence. Functional proteins must begin with a Start codon (where DNA transcription begins), and end with a Stop codon (where transcription ends).
A brief introduction to two techniques used to study protein interactions: Yeast two hybrid (Y2H) system and Chromatin immunoprecipitation(ChIP)
I hope it helps and please comment if I've made any mistakes.
The document summarizes key aspects of amino acids and protein structure in 3 paragraphs or less:
Amino acids are the building blocks of proteins. They contain common structural features and exist in L- and D-forms. In proteins, amino acids are exclusively in the L-conformation. Amino acids are classified based on the properties of their side chains into nonpolar, aromatic, polar, positively charged, and negatively charged categories.
Protein structure is hierarchical, progressing from primary to secondary, tertiary, and quaternary levels. The primary structure is the amino acid sequence. Secondary structures include alpha helices, beta sheets, and turns formed by hydrogen bonding. Tertiary structure refers to the overall 3
A talk that I gave to a a general audience at UC Davis. Slides were also used for Prof. Ian Korf's presentation at the Genome 10K workshop (May 25th, 2013). This talk mostly concerns the results of the Assemblathon 2 contest, but also covers other issues relating to genome assembly.
Note, this talk has been superseded by updated versions (also available on slideshare)!
This is the second presentation of the BITS training on 'Mass spec data processing'.
It reviews the methods for separating protein mixtures prior to further analysis.
Thanks to the Compomics Lab of the VIB for contribution.
Apollo is a web-based application that supports and enables collaborative genome curation in real time, allowing teams of curators to improve on existing automated gene models through an intuitive interface. Apollo allows researchers to break down large amounts of data into manageable portions to mobilize groups of researchers with shared interests.
The i5K, an initiative to sequence the genomes of 5,000 insect and related arthropod species, is a broad and inclusive effort that seeks to involve scientists from around the world in their genome curation process, and Apollo is serving as the platform to empower this community.
This presentation is an introduction to Apollo for the members of the i5K Pilot Project working on species of the order Hemiptera.
B.sc biochem i bobi u 3.3 homologous and heterologousRai University
This document defines and compares heterologs, homologs, analogs, orthologs, and paralogs. Heterologs differ in origin and activity, while homologs have a common origin but not necessarily common activity. Sequence similarity is a quantitative measure of how many bases match between two aligned sequences. Analogs have common activity but different origins, evolving convergently. Orthologs are homologs that evolved from a common ancestral gene through speciation, often retaining the same function. Paralogs are homologs produced through gene duplication within a genome, and may evolve new functions.
Making Protein Function and Subcellular Localization Predictions: Challenges ...fionabrinkman
The document discusses challenges and opportunities in predicting protein function and subcellular localization from sequence data alone. It outlines issues with current orthology-based and pathway-based prediction methods, and ways to improve functional predictions by differentiating true orthologs from non-orthologous relationships and developing better pathway signatures. The author advocates for databases like OrtholugeDB that pre-compute ortholog predictions across many genomes to facilitate large-scale evaluation of prediction methods.
Genome assembly: the art of trying to make one big thing from millions of ver...Keith Bradnam
A talk about genome assembly. Largely aimed at people new to the field, this slide deck is an updated version of a talk that I first gave last year and which I recently presented as part of a UC Davis Bioinformatics Core training workshop.
Author: Keith Bradnam, Genome Center, UC Davis
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This document discusses various methods for annotating genomes after sequencing and assembly. Sequence analysis approaches like identifying open reading frames can rapidly and inexpensively find some genes, but have weaknesses like false positives and missing short genes. More accurate methods are needed to find non-coding RNAs, pseudogenes, and other elements. As sequencing technologies generate more data, the bottleneck has shifted to analysis, requiring skills in both biology and mathematics. The document provides an example sequence to annotate and poses questions about fast, cheap and accurate annotation methods.
The document provides an overview of plant genome sequence assembly, including:
1) A brief history of sequencing technologies and their improvements over time, from Sanger sequencing to newer technologies producing longer reads.
2) Key steps in a sequencing project including read processing, filtering, and corrections before assembly into contigs and scaffolds using appropriate software.
3) Factors to consider for experimental design and assembly optimization such as sequencing depth, library types, and software choices depending on the genome and data characteristics.
Computational Approaches to Systems BiologyMike Hucka
Presentation given at the Sydney Computational Biologists meetup on 21 August 2013 (http://australianbioinformatics.net/past-events/2013/8/21/computational-approaches-to-systems-biology.html).
Homologous genes are genes that have descended from a common ancestral gene. There are two main types of homologous genes:
1. Orthologous genes are homologous genes in different species that arose due to speciation. For example, the human and mouse eyeless genes are orthologs that descended from the eyeless gene in their last common ancestor.
2. Paralogous genes are homologous genes within the same species that arose due to a gene duplication event. For example, the fruit fly eyeless and twin of eyeless genes are paralogs that descended from a duplication of the eyeless gene in a fruit fly ancestor.
Homologous genes can differ in their sequences due
The document discusses various methods for predicting protein function, including homology-based transfer of annotation and prediction of functional motifs and domains. Homology-based transfer can infer molecular function from sequence similarity, but biological process is only transferable between orthologs. Orthologs can be detected through phylogenetic trees or automated methods like InParanoid. Each protein domain contributes to molecular function, while short motifs like phosphorylation sites are also important. Functional annotation involves describing proteins at the molecular, biological process, and cellular component levels.
Genetics is the study of genes, heredity, and variation in living organisms. It is a broad discipline that includes molecular genetics, transmission genetics, population genetics, and many other fields. Some key areas of genetics are molecular genetics, which studies genes at the molecular level; transmission genetics, which explores inheritance patterns; population genetics, which studies genetic variation in populations; and quantitative genetics, which examines continuously measured traits. Genetics interfaces with disciplines like biochemistry, molecular biology, and evolution and has applications in areas such as agriculture, medicine, and conservation.
Genome annotation, NGS sequence data, decoding sequence information, The genome contains all the biological information required to build and maintain any given living organism.
Proteomics and its applications
Proteomics involves the analysis of the entire complement of proteins in a cell, tissue or organism. It assesses protein activities, modifications, localization and interactions. Proteomics uses techniques like gel electrophoresis, mass spectrometry and liquid chromatography to separate and identify proteins. These techniques can be applied to discover disease biomarkers, develop diagnostic tools, and gain insights into disease pathogenesis and treatment. Proteomics has applications in studying various diseases including cancer, diabetes and infections. It provides insights into cellular processes and systems biology.
The document discusses the field of proteomics, which is the large-scale study of proteins, including their functions and structures. It defines proteomics and describes several areas within it, such as functional proteomics, expressional proteomics, and structural proteomics. It outlines typical proteomics experiments and some key methods used, including two-dimensional electrophoresis, mass spectrometry, and protein-protein interaction prediction methods like phylogenetic profiling.
1. Recombinant DNA technology uses restriction enzymes and DNA ligase to cut and join DNA from different sources, allowing genes to be transferred between organisms.
2. Polymerase chain reaction (PCR) amplifies specific DNA sequences, enabling rapid copying of genes. It is used in DNA fingerprinting for identification.
3. Transgenic organisms have foreign genes inserted, allowing production of useful proteins like insulin from bacteria and growth hormones from animals and plants.
This document summarizes C. Titus Brown's research on assembling large, complex metagenomes from Illumina sequencing data. Brown discusses using digital normalization and data partitioning techniques to preprocess and separate reads before assembly. He finds these approaches help smooth coverage and reduce errors, improving assemblies. While soil metagenomes remain challenging due to high strain variation, Brown's group has had success assembling other sample types. They are working to optimize and evaluate assembly methods to better reconstruct genomes from metagenomic data.
Proteomics is the study of the structure and function of proteins. It involves identifying and quantifying the proteins expressed by a genome or cell type. Key aspects of proteomics include protein separation techniques like gel electrophoresis, mass spectrometry to identify proteins, and analyzing protein interactions and post-translational modifications. While genomes provide the blueprint, proteomics helps understand the diversity of proteins expressed and how they function together to direct cellular activities. It is a promising tool for disease diagnosis by identifying protein biomarkers.
The document discusses using genomic context analysis and high-throughput data to construct and interpret networks of functional associations between genes and proteins. It describes the STRING database, which uses genomic context evidence from 110 species to predict functional links. It also discusses integrating various high-throughput data types, like protein-protein interaction data and gene expression data from microarrays, to improve the coverage and accuracy of predicted functional associations in STRING. Normalization methods and singular value decomposition are used to analyze and combine expression data from multiple experiments.
The document outlines topics that will be covered in a bioinformatics course, including biological databases, sequence alignments, database searching, phylogenetics, protein structure, gene prediction, gene ontologies, hidden Markov models, and non-coding RNA. It then provides more details on topics like using sequence similarity to gather information from unknown protein sequences, finding conserved patterns in alignments using profiles and hidden Markov models, and challenges around gene prediction from raw genomic sequences.
This document discusses genomic technologies that can be used to observe the human genome and their applications. It covers microarrays, next-generation sequencing, DNA methylation, copy number variation, and more. Challenges include the cost of these technologies and integrating the large amounts of data they produce to improve healthcare.
Here are some suggestions for open online bioinformatics lectures and courses from famous universities:
- MIT OpenCourseWare has free bioinformatics course materials and videos from MIT courses.
- edX has massive open online courses (MOOCs) in bioinformatics from universities like Harvard, Berkeley, MIT. Some are free to audit.
- Coursera has bioinformatics courses from top universities like Johns Hopkins, University of Toronto, Peking University.
- YouTube has full lecture videos from bioinformatics courses at universities like Stanford, UC San Diego, University of Cambridge.
- Khan Academy has introductory bioinformatics lectures on topics like sequence alignment, gene finding, protein structure.
- EMBL-
This document discusses high-resolution views of the cancer genome using various technologies including DNA microarrays, comparative genomic hybridization, tiling arrays, next-generation sequencing, and DNAse-Seq. It describes how these technologies can be used to analyze gene expression, copy number variation, chromatin structure, and more to better understand cancer at the genomic level. Integrating data from all these sources presents challenges but may help improve individual health outcomes.
This document summarizes the types and hierarchical levels of biological databases, including primary databases that contain raw sequence data, secondary databases that contain human-curated knowledge, and tertiary databases that integrate information from various sources. It provides examples of important databases like GenBank, UniProt, and RefSeq. It also describes how bioinformatics analyzes genetic information through sequence comparison, assembly, annotation and other computational methods to study molecular structures, functions, and diseases.
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Natalio Krasnogor
These slides are part of a presentation I gave on March 2010 at the BioInformatics and Genome Research Open Club at the Weizmann Institute of Science, Israel.
In these slides my student and I describe two web-applications for microarray and gene/protein set analysis,
ArrayMining.net and TopoGSA. These use ensemble and consensus methods as well as the
possibility of modular combinations of different analysis techniques for an integrative view of
(microarray-based) gene sets, interlinking transcriptomics with proteomics data sources. This integrative process uses tools from different fields, e.g. statistics, optimisation and network
topological studies. As an example for these integrative techniques, we use a microarray
consensus-clustering approach based on Simulated Annealing, which is part of the ArrayMining.net
Class Discovery Analysis module, and show how this approach can be combined in a modular
fashion with a prior gene set analysis. The results reveal that improved cluster validity indices can be obtained by merging the two methods, and provide pointers to distinct sub-classes within pre-defined tumour categories for a breast cancer dataset by the Nottingham Queens Medical Centre.
In the second part of the talk, I show how results from a supervised
microarray feature selection analysis on ArrayMining.net can be investigated in further detail with
TopoGSA, a new web-tool for network topological analysis of gene/protein sets mapped on a
comprehensive human protein-protein interaction network. I discuss results from a TopoGSA
analysis of the complete set of genes currently known to be mutated in cancer.
Single-cell RNA sequencing (scRNA-seq) allows researchers to analyze gene expression at the individual cell level, exposing heterogeneity that is hidden in bulk tissue analysis. There are various platforms for scRNA-seq that differ in throughput and customizability. Experimental design considerations include the number of cells to sequence, desired sequencing depth, and controlling for batch effects. The analysis workflow generally involves processing and filtering data, normalization, clustering, differential expression analysis, and trajectory inference to reconstruct cellular responses.
This document provides an overview of bioinformatics. It defines bioinformatics as the science of collecting, analyzing and conceptualizing biological data through computational techniques. It discusses that bioinformatics involves managing, organizing and processing biological information from databases, as well as analyzing, visualizing and sharing biological data over the internet. It also outlines some of the goals of bioinformatics like organizing the human and mouse genomes, as well as some applications like genomic and protein sequence analysis, protein structure prediction, and characterizing genomes.
ASHG 2015 - Redundant Annotations in Tertiary AnalysisJames Warren
After obtaining genetic variants from next generation sequencing data, a precursory step in tertiary analysis is to annotate each variant with available relevant information. There is no standardized compendium for this purpose; researchers instead are required to compile data from a motley of annotation tools and public datasets. These sources for annotation are independently maintained, and accordingly there is limited concordance between their reported contents. The choice of annotation datasets thus has a direct and significant impact on the results of the analysis.
The document discusses ongoing efforts to develop more comprehensive human genome variant detection benchmarks, even as sequencing technologies continue advancing. It summarizes:
1) The Genome in a Bottle Consortium's work characterizing increasingly challenging variants and regions for benchmarking, including seven human genomes as reference materials.
2) Current efforts to benchmark variants in tandem repeats and develop new benchmarks based on complete diploid genome assemblies.
3) Planned expansions of the benchmarks to include additional genomes, variant types like mosaic variants, and integration with other omics data like RNA sequencing and methylation.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.jennomics
Presentation at a workshop conducted by the UC Davis Bioinformatics Core Facility: Using the Linux Command Line for Analysis of High Throughput Sequence Data, September 15-19, 2014
This document discusses various methods for predicting genes and analyzing unknown DNA sequences, including:
- Using profiles, patterns, and hidden Markov models (HMMs) to find conserved sequences and predict protein function
- Ontologies like Gene Ontology that organize genes and gene products in a structured network to facilitate annotation and analysis
- Computational tools like Genefinder and Glimmer that use signals like coding potential, open reading frames, start/stop codons, and sequence similarity to known genes to predict gene structures in sequences
- Integrating multiple lines of evidence, like HMMs, EST alignments, repeats, and CpG islands, can improve gene prediction over a single method.
The document discusses building a global map of human gene expression by integrating data from thousands of gene expression experiments deposited in public databases. It describes two approaches: 1) Integrating over 9,000 samples on a quantitative level to identify major expression classes and differentially expressed genes. 2) A meta-analysis approach to identify condition-specific differentially expressed genes across many studies. Combining both approaches could provide a comprehensive global map of human gene expression.
This document discusses using hidden Markov models (HMMs) for gene prediction and analysis of unknown DNA sequences. It explains that HMMs allow for probabilistic modeling of sequences that accounts for insertions and deletions, and can be used to identify coding regions, splice sites, repeats and other features in genomic sequences. The document provides examples of using HMMs to represent proteins and DNA as probabilistic state machines, and describes how HMMs can incorporate profile data to enable database searching and gene prediction.
Microbial Phylogenomics (EVE161) Class 17: Genomes from UnculturedJonathan Eisen
This paper describes the results of applying whole-genome shotgun sequencing to microbial populations collected from the Sargasso Sea near Bermuda. The authors sequenced over 1 billion base pairs from environmental samples and assembled these sequences into 1,045 large fragments. Analysis of the sequences revealed an unexpected diversity of microbial species, including many novel lineages. This pioneering study demonstrated the power of metagenomic sequencing to reveal the vast diversity of microbial life in the oceans.
The document discusses various applications and techniques of DNA microarrays, including summarizing key points about Affymetrix GeneChips, spotted microarrays, experimental design, data analysis, and several case studies on various topics like ovarian cancer, Sjogren's syndrome, wine yeast genomics, and norovirus genotyping. Microarrays allow analysis of gene expression patterns and copy number variations across genomes through comparative hybridization experiments. The document provides an overview of microarray technology and applications in genomic and biomedical research.
This study developed a computational approach combining gene expression ranking and co-expression network analysis to identify novel stress regulatory genes in Arabidopsis thaliana. The researchers ranked genes based on their expression response to multiple abiotic stresses and analyzed co-expression networks to select candidate stress regulatory genes. Screening 62 mutants defective in candidate genes yielded a remarkably high gene discovery rate of up to 62% for stress phenotypes, far greater than typical screens. Additionally screening 64 other mutants based only on expression ranking yielded a lower but still improved rate of 36%, showing the power of combining methods. This systems approach can enhance gene discovery for other processes given suitable transcriptome data.
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Leonel Morgado
Current descriptions of immersive learning cases are often difficult or impossible to compare. This is due to a myriad of different options on what details to include, which aspects are relevant, and on the descriptive approaches employed. Also, these aspects often combine very specific details with more general guidelines or indicate intents and rationales without clarifying their implementation. In this paper we provide a method to describe immersive learning cases that is structured to enable comparisons, yet flexible enough to allow researchers and practitioners to decide which aspects to include. This method leverages a taxonomy that classifies educational aspects at three levels (uses, practices, and strategies) and then utilizes two frameworks, the Immersive Learning Brain and the Immersion Cube, to enable a structured description and interpretation of immersive learning cases. The method is then demonstrated on a published immersive learning case on training for wind turbine maintenance using virtual reality. Applying the method results in a structured artifact, the Immersive Learning Case Sheet, that tags the case with its proximal uses, practices, and strategies, and refines the free text case description to ensure that matching details are included. This contribution is thus a case description method in support of future comparative research of immersive learning cases. We then discuss how the resulting description and interpretation can be leveraged to change immersion learning cases, by enriching them (considering low-effort changes or additions) or innovating (exploring more challenging avenues of transformation). The method holds significant promise to support better-grounded research in immersive learning.
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
The technology uses reclaimed CO₂ as the dyeing medium in a closed loop process. When pressurized, CO₂ becomes supercritical (SC-CO₂). In this state CO₂ has a very high solvent power, allowing the dye to dissolve easily.
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...AbdullaAlAsif1
The pygmy halfbeak Dermogenys colletei, is known for its viviparous nature, this presents an intriguing case of relatively low fecundity, raising questions about potential compensatory reproductive strategies employed by this species. Our study delves into the examination of fecundity and the Gonadosomatic Index (GSI) in the Pygmy Halfbeak, D. colletei (Meisner, 2001), an intriguing viviparous fish indigenous to Sarawak, Borneo. We hypothesize that the Pygmy halfbeak, D. colletei, may exhibit unique reproductive adaptations to offset its low fecundity, thus enhancing its survival and fitness. To address this, we conducted a comprehensive study utilizing 28 mature female specimens of D. colletei, carefully measuring fecundity and GSI to shed light on the reproductive adaptations of this species. Our findings reveal that D. colletei indeed exhibits low fecundity, with a mean of 16.76 ± 2.01, and a mean GSI of 12.83 ± 1.27, providing crucial insights into the reproductive mechanisms at play in this species. These results underscore the existence of unique reproductive strategies in D. colletei, enabling its adaptation and persistence in Borneo's diverse aquatic ecosystems, and call for further ecological research to elucidate these mechanisms. This study lends to a better understanding of viviparous fish in Borneo and contributes to the broader field of aquatic ecology, enhancing our knowledge of species adaptations to unique ecological challenges.
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
Or: Beyond linear.
Abstract: Equivariant neural networks are neural networks that incorporate symmetries. The nonlinear activation functions in these networks result in interesting nonlinear equivariant maps between simple representations, and motivate the key player of this talk: piecewise linear representation theory.
Disclaimer: No one is perfect, so please mind that there might be mistakes and typos.
dtubbenhauer@gmail.com
Corrected slides: dtubbenhauer.com/talks.html
ESPP presentation to EU Waste Water Network, 4th June 2024 “EU policies driving nutrient removal and recycling
and the revised UWWTD (Urban Waste Water Treatment Directive)”
Phenomics assisted breeding in crop improvementIshaGoswami9
As the population is increasing and will reach about 9 billion upto 2050. Also due to climate change, it is difficult to meet the food requirement of such a large population. Facing the challenges presented by resource shortages, climate
change, and increasing global population, crop yield and quality need to be improved in a sustainable way over the coming decades. Genetic improvement by breeding is the best way to increase crop productivity. With the rapid progression of functional
genomics, an increasing number of crop genomes have been sequenced and dozens of genes influencing key agronomic traits have been identified. However, current genome sequence information has not been adequately exploited for understanding
the complex characteristics of multiple gene, owing to a lack of crop phenotypic data. Efficient, automatic, and accurate technologies and platforms that can capture phenotypic data that can
be linked to genomics information for crop improvement at all growth stages have become as important as genotyping. Thus,
high-throughput phenotyping has become the major bottleneck restricting crop breeding. Plant phenomics has been defined as the high-throughput, accurate acquisition and analysis of multi-dimensional phenotypes
during crop growing stages at the organism level, including the cell, tissue, organ, individual plant, plot, and field levels. With the rapid development of novel sensors, imaging technology,
and analysis methods, numerous infrastructure platforms have been developed for phenotyping.
4. Genome Sequencing
Step by step…
• Library construction and sequencing
• Base-calling: Quality Control
• Assembly (repeat as necessary)
• Annotation (repeat as necessary)
• Publish!
7. Genome Size Variability
• Extent of gene duplication
• Repetitive DNA
• Gene size
– number and length of introns
• Space between genes (% coding)
– regulatory regions
– heterochromatin
9. How do genes get names?
How to find genes in genomes
Problems and strategies in genome annotation
Databases that are useful for annotation
How are genes related to other genes?
RAST
12. Genome fragment of Nitrosocaldus yellowstonii
How does a gene get a name?
ORF = CDS = gene ?http://www.genenames.org
13. Automated Annotation Pipelines/Servers
• Provide fast analysis of genomic sequences
o
gene identification & function prediction
• Used to rely on information in public databases
(beware!)
• Now often based on re-analysis of published
genomes
• Rely on “curated” reference genomes
like Prokka
14. Box 2 | Gene prediction versus gene annotation
Although the terms ‘gene prediction’ and ‘gene annotation’ are often used as if they are synonyms, they are not. With a
few exceptions, gene predictors find the single most likely coding sequence (CDS) of a gene and do not report untranslated
regions (UTRs) or alternatively spliced variants. Gene prediction is therefore a somewhat misleading term. A more
accurate description might be ‘canonical CDS prediction’.
Nature Reviews | Genetics
229,500 229,000 228,500 228,000 227,500 226,500227,000
bp
5′UTR 3′UTR
Gene annotation resulting
from synthesizing all
available evidence
(two alternative splice forms)
Protein evidence
(BLASTX)
mRNA or EST evidence
(Exonerate)
Gene prediction
(SNAP)
Start codon Stop codon
More types of data help annotation
15. Nature Reviews | Genetics
Post process gene predictions to
add UTRs and alternatively spliced
transcripts based on evidence
Consensus-
based chooser
Consensus-
based chooser
Run battery of ab initio
gene predictors
Align ESTs, proteins and RNA-seq data to genome
Run battery gene predictors in evidence-driven mode
Run single ab initio
gene predictor
Best consensus CDS
model for each gene
Best consensus mRNA
model(s) for each gene mRNA model(s) for each
gene most consistent
with evidence
Most likely CDS
model for each gene
Optional manual curation using genome browser
Manually curated
gene models
Increasing accuracy
Consensus-
based chooser
Evidence-
based chooser
Best consensus CDS
model for each gene
Option 2:
predict and choose
Option 3:
full-scale annotation pipelines
Option 1:
predict
nreasintimeaneort
Increasinguseofevidence
Figure 2 | Three basic approaches to genome annotation and some common variations. Approaches are
compared on the basis of relative time, effort and the degree to which they rely on external evidence, as opposed to
17. 17
ugh to ensure acceptance of
9, 10]. There has also been a
neration techniques such as
g experimental methods
n of a protein’s role and
. These annotations would
se they are based on actual
than homology. Currently
idence tags stating how the
, however, they are often
s. Including evidence quali-
dea of the reliability of the
concept of assigning a level
is not novel, but is seldom
ome of the current steps for
otation and offers a guide to
oblems that are encountered
tion. It goes on to identify
ce genomes and why choos-
not always the best option.
of the public sequence data-
st possible next steps toward
rehensive annotation with
errors.
erial genomes Figure 1: A generic process for bacterial genome
Richardson and Watson
atMhttp://bib.oxfordjournals.org/Downloadedfrom
Steps in Genome Annotation
18. Identification of protein-coding regions
Intrinsic evidence
• Absence of stop codons (TAA,TGA,TAG)
• Sufficient open-reading frame (ORF) length (~100 a.a.)
• Presence of start codon (ATG, GTG, TTG)
• Minimize gene overlap
• presence of other sequence motifs (TATA, RBS, splice sites, polyA)
Extrinsic evidence
• Similarity to “known” genes from other organisms (HOMOLOGY)
• Expression data (mRNA sequencing, proteomics)
• Predicted sequence analysis (e.g., protein structure modeling)
19.
20. 20
What are several ways that could explain sequence similarity
between molecular sequences?
What are potential pitfalls with assigning homology?
How do we generally assign homology?
22. the Mycoplasma genitalium genome1
(Fig. 1). Where two
groups’ descriptions are completely incompatible, at least
one must be in error. In my analysis, there is no penalty
sions – a likely occurrence because all relied on simil
methods and data. This evaluation also ignores minor d
agreements in annotation, and disparities in degree
specificity (possibly indicating problematic overpredicti
of function4
). Therefore, the true error rate must
greater than these figures indicate.
There are several possible reasons why the function
analyses have mistakes, as described at greater length els
where5–8
. For example, it may be that the similar
between the genomic query and database sequence
insufficient to reliably detect homology, an issue solvab
by appropriate use of modern and accurate sequence com
parison procedures9,10
. A more difficult problem is accura
inference of function from homology. Typical databa
searching methods are valuable for finding evolutionar
related proteins, but if there are only about 1000 maj
superfamilies in nature11,12
, then most homologs mu
have different molecular and cellular functions.
The annotation problem escalates dramatically beyo
the single genome, for genes with incorrect functions a
entered into public databases8
. Subsequent search
against these databases then cause errors to propagate
future functional assignments. The procedure need cyc
only a few times without corrections before the resourc
that made computational function determination possib
– the annotation databases – are so polluted as to
almost useless. To prevent errors from spreading out
control, database curation by the scientific commun
will be essential4,13
.
To ensure that databases are kept usable, the intent o
gene annotation should be clear: does it indicate homolo
ortholog, and/or functional equivalence? Fortunately, som
databases already incorporate this information explici
(e.g. Ref. 14). Errors will, of course, still creep in. To he
FIGURE 1. Comparison of annotations
Three dots represent (left to right) Frasier et al.1
, Koonin et al.2
and Ouzounis et
al.3
annotations for each of the 468 M. genitalium genes. (Tentative cases
001 051 101 151 201 251 301 351 401 451
M. genitalium
Black circle = no annotation
468 genes
Colored circle = different
Blue circle = same annotation
TIG April 1999, volume 15, No. 4 13
atory of Molecular Biology, Hills Road,
UK. M. Levitt, C. Chothia, B. Al-Lazikani
provided stimulating discussion.
No. groups No. Annotations per group Total No.
annotating gene genes annotations conflicts
Frasier Koonin Ouzounis
et al.1
et al.2
et al.3
0 33 – – – – N/A
1b
95 14 15 66 95 N/A
2 318 279 317 40 636 45
3 22 22 22 22 66 10
Sum (2+3) 340 301 339 62 702 55
Summary of annotations made by each group (Fig. 1), minimal number of conflicting annotations (s
the resulting minimal fraction of annotations that are erroneous.
a
Frasier et al.1
data from http://www.tigr.org/tdb/mdb/mgdb/mgdb.html. Koonin et al.2
data from ht
nlm.nih.gov/Complete_Genomes/Mgen. Ouzounis et al.3
data from http://www.embl-heidelberg
mycogen.new.html. Instances where Ouzounis et al.3
reported SWISS-PROT annotation of the same gene w
avoid duplication with Frasier et al.1
entries. However, even if all of these 300 annotations are included
annotation error rate drops only to 6%. All annotations were collected in 1996, shortly after the genom
b
No comparative analysis is possible when only one group made an annotation.
al. (1995) The minimal gene complement of Mycoplasma
nce 270, 397–403
l. (1996) Sequencing and analysis of bacterial genomes.
4–416
al. (1996) Novelties from the complete genome of Mycoplasma
Microbiol. 20, 898–900
(1998) Protein annotation: detective work for function prediction.
, 248–250
nd Koonin, E.V. (1998) Sources of systematic error in functional
nomes: domain rearrangement, non-orthologous gene
nd operon disruption. In Silico Biol. 1, 7
Zhang, X. (1997) The challenges of genome sequence annotation or
he details’. Nat. Biotechnol. 15, 1222–1223
998) Predicting function: from genes to genomes and back.
, 707–725
roch, A. (1996) Go hunting in sequence databases but watch out for
s Genet. 12, 425–427
al. (1998) Assessing sequence
hods with reliable structurally identified distant evolutionary
oc. Natl. Acad. Sci. U. S. A.
al. (1994) Issues in searching molecular sequence databases.
19–129
11 Chothia, C. (1992) Proteins. One thousand families for the molecular biologist.
Nature 357, 543–544
12 Brenner, S.E. et al. (1997) Population statistics of protein structures: lessons from
structural classifications. Curr. Opin. Struct. Biol. 7, 369–376
13 Smith, T.F. (1998) Functional genomics – bioinformatics is ready for the challenge.
Trends Genet. 14, 291–329
14 Tatusov, R.L. et al. (1997) A genomic perspective on protein families. Science 278,
631–637
23. COMMENTErrors in genome annotation
FIGURE 2. Example annotations and analysis
(a) Consistent annotations. Annotations were generally considered consistent for this analysis if either the function or the gene name match (e.g. mg463; mg010).
An exception is when one group uses a gene name and another specifically notes that the current gene is a paralog and not identical (consider mg010). Where the
descriptions from different groups were compatible, but of different levels of specificity, this was considered a correct assignment (e.g. mg225). The difficulty of
reconciling pairs of descriptions to determine whether they reflect compatible functions makes this analysis imprecise. Generally, the approach here is generous
and should err on the side of detecting too few errors; it is usually more permissive than Ref. 5. mg463: Frasier et al.1
and Koonin et al.2
describe different aspects
of function, but give the same gene name. The Ouzounis et al.3
description is compatible with that from Koonin et al.2
, but less specific. All three annotations are
considered correct for this analysis. mg010: Frasier et al.1
and Ouzounis et al.3
agree that this is a DNA primase. Koonin et al.2
use a different gene name and
explicitly state that this is a truncated protein. Because of the common functional descriptions, all three are considered correct. However, if Koonin et al.2
had been
more explicit in indicating a functional difference, then their annotation would have been marked as conflicting. (Note that mg250 is also annotated as a DNA primase
by all three groups.) mg225: the Ouzounis et al.3
annotation of histidine permease is more specific than the Koonin et al.2
description of amino acid permease. It may
be that histidine permease is an (incorrect) overprediction of function, or it could be correct. The two annotations are considered consistent, and the decision of
Frasier et al.1
not to provide a function is not penalized. (b) Inconsistent annotations. mg302: lack of a functional assignment from Frasier et al.1
is not penalized.
The Koonin et al.2
and Ouzounis et al.3
annotations are wholly inconsistent. This leads to a conflict and a minimum error rate of 50%. Note that the assessment
(a)
mg463
Frasier et al. High level kasgamycin resistance (ksgA)
Koonin et al. rRNA (adenosine-N6, N6-)-dimethyltransferase (ksgA)
Ouzounis et al. Dimethyladenosine transfe [sic]
mg010
Frasier et al. DNA primase (dnaE)
Koonin et al. DNA primase (truncated version) (DnaGp)
Ouzounis et al. DNA primase (EC 2.7.7.-)
mg225
Frasier et al. Hypothetical protein
Koonin et al. Amino acid permease
Ouzounis et al. Histidine permease
(b)
mg302
Frasier et al. No database match
Koonin et al. (Glycerol-3-phosphate?) permease
Ouzounis et al. Mitochondrial 60S ribosomal protein L2
mg448
Frasier et al. Pilin repressor (pilB)
Koonin et al. Putative chaperone-like protein
Ouzounis et al. PilB protein
mg085
Frasier et al. Hydroxymethylglutaryl-CoA reductase (NADPH)
Koonin et al. ATP(GTP?)-utilizing enzyme
Ouzounis et al. NADH-ubiquinone oxidoredu [sic]
Two kinds of problems
insufficient similarity to assume homology
inference of function from homology
24. 24
Table 1 Statistics for different annotations for H. utahensis genome along with the extended annotations. For orphan and functional genes
genes and the percentage relative to the total number of annotated genes
Annotation features NCBI AAMG RAST Extend
Original Complemented by
annotation of function
from AAMG and RAST
Original Complemented by
annotation of function
from NCBI and RAST
Original Complemented by
annotation of function
from NCBI and AAMG
EA
CDS 2998 2998 3040 3040 3041 3041 2980
rRNA 4 4 3 3 3 3 4
tRNA 45 45 45 45 45 45 45
ncRNA 1 1 0 0 0 0 0
frameshift/Pseudo 0 0 0 0 0 0 0
Total 3048 3048 3088 3088 3089 3089 3029
Orphan genes 1014 (33.27 %) 777 (25.49 %) 885 (28.66 %) 837 (27.10 %) 1203 (38.94 %) 819 (26.51 %) 672 (22
Functional genes 2034 (66.73 %) 2271 (74.51 %) 2203 (71.34 %) 2251 (72.90 %) 1886 (61.06 %) 2270 (73.49 %) 2357 (7
Another issue with annotation
Not all proteins have homologs in
another genome — check out
Giardia
25. Reflecting annotation uncertainty in
gene names
• “Domain”-containing protein
Predicted protein contains a region similar to a recognized
protein domain or fold
– ankyrin-repeat domain containing protein
• Conserved hypothetical protein
Predicted protein is homologous to predicted proteins in at least
one other (distinct!) organism
• Hypothetical protein
Nothing is known about the predicted protein (no known
homologs)
Avoid “-like” as homology is a yes/no
26. Dangers of Serial Annotation
• Function is generally “inferred” from homology
• Poor annotations are propagated in the public sequence databases
(GenBank) - think the Telephone Game
• Failure to examine functional assignation leads to deterioration of data
and errors
• Manual curation is needed to validate annotation and add valuable
information
• Particularly important for representatives of new lineages
– often homologous genes in new lineages are very different from those
in other organisms
– need good annotation of “anchor” genomes for subsequent
sequencing
30. 30
0.4
LCGC14AMP_05736710
Crenactin
LCGC14AMP and Lokiarchaeum (4/1)
Actin and related sequences
Arp2
LCGC14AMP (5)
Arp1
LCGC14AMP and
Lokiarchaeum
(11/1)
LCGC14AMP/Lokiarchaeum (11/2)
LCGC14AMP (2)
Arp3
LCGC14AMP (2)
LCGC14AMP_06532160
100
100
51
83
100
100
96
100
100
100
100
a c
b
Nitrosopumilus maritimus SCM1 0
LCGC14AMP and
Lokiarchaeum (5/1)
RESEARCH ARTICLE
What if you found eukaryotic genes in an archaeon?
30
31.
32. 32
How do genes evolve?
Speciation —> Diversification
Gene duplication —> Diversification
33. Almost half of the genes in any
genome are in gene families
33
and are deleted from the genome. The rate of duplication
that gives rise to stably maintained genes is the birth rate
multiplied by the retention rate, which is expected to
fluctuate with gene function, among other things.
Duplicated genes are often referred to as paralogous
genes, which form gene families. Several authors have
tabulated the distribution of gene family size for a few
completely sequenced genomes [11,12] and this varies
substantially among species and gene families [13]; for
instance, the biggest gene family in D. melanogaster is the
Table 1. Prevalence of gene duplication in all three domains of
lifea
Total
number
of genes
Number of duplicate
genes (% of
duplicate genes) Refs
Bacteria
Mycoplasma pneumoniae 677 298 (44) [65]
Helicobacter pylori 1590 266 (17) [66]
Haemophilus influenzae 1709 284 (17) [67]
Archaea
Archaeoglobus fulgidus 2436 719 (30) [68]
Eukarya
Saccharomyces cerevisiae 6241 1858 (30) [67]
Caenorhabditis elegans 18 424 8971 (49) [67]
Drosophila melanogaster 13 601 5536 (41) [67]
Arabidopsis thaliana 25 498 16 574 (65) [69]
Homo sapiens 40 580b
15 343 (38) [11]
a
Use of different computational methods or criteria results in slightly different
estimates of the number of duplicated genes [12].
b
The most recent estimate is ,30 000 [61].
34. Paralogous Gene Families
- many genes in the genome are present in “families” and
each gene in a gene family shares a common ancestry
(homologs)
- gene families arise from duplication and subsequent
diversification by various mechanisms
how are these copies different from alleles?
full
duplicated
dead new
ancestral
Possible fates of duplicated genes:
sub
35. Evolutionary fates of duplicated genes
pseudogenization - non-functional when accumlates a stop
codon.
this gene is eventually lost from genome. but young
pseudogenes would still recognizable as a homolog. why?
35
conservation of function - extra copy could provide
greater amounts of protein. why?
36. 36
subfunctionalization - extra copy could have a new
function (or a sub-function). why?
– most proteins have > 1 function (could be expressed
differently in different parts of cell/tissue or at different
times)
– if greater amounts of a protein not advantageous, extra
copy would be selected against unless…
– subfunctionalization - both copies adopt some functions
of parent gene (moonlighting functions)
– sometimes this can be differential gene expression in
different tissues
37. 37
Neofunctionalization - extra copy could have a
novel function. why?
–often a related function (not entirely new)
–opsin gene family is a good example
–this could require a lot of mutations in new gene copy
38. Orthologs and paralogs
a A*b* c BC*
Ancestral gene
Duplication to give 2
copies = paralogs on the
same genome
orthologousorthologous
paralogous
A*C*b*
A mixture of orthologs
and paralogs sampled
potential problem
39. Orthologs: Homologs inherited after speciation.
Gene phylogeny may match organismal phylogeny.
Paralogs: Homologs produced by gene duplication.
Multiple homologs in a given species or evidence that
gene duplication involved through phylogenetic
analysis and lack of match to organismal phylogeny
Gene phylogeny does not match organismal phylogeny
in a tree where most genes do match organismal
phylogeny well.
40. Using phylogeny to check for
paralogs (or orthologs)
–multiple copies of a gene in a genome
–look at which clades contain the paralogs
–duplication events can occur > once
–can be paralog loss/gain
40
41. 41
actin
ARPs
crenactin
41
0.4
LCGC14AMP_05736710
Crenactin
LCGC14AMP and Lokiarchaeum (4/1)
Actin and related sequences
Arp2
LCGC14AMP (5)
Arp1
LCGC14AMP and
Lokiarchaeum
(11/1)
LCGC14AMP/Lokiarchaeum (11/2)
LCGC14AMP (2)
Arp3
LCGC14AMP (2)
LCGC14AMP_06532160
100
100
51
83
100
100
96
100
100
100
100
a c
b
Lokiar
Eur
Lokiarch_12
Arf-family
Lok
Lokiarc
170290521 C
Lokiarch_31930
Lokiarchaeum
Lok
Lokiarch
Sar1-fam
Lokiarch
3154254
Lokiarchaeum (3)
Rab-family (7
5
Lokiarchaeum (4)
Lokiarchae
Lokiarch_45420
Lokiarc
51
79
100
99
100
71
87
99
84
96
97
89
100
95
82
93
69
100
68
61
82
99
97
Arabidopsis thaliana
Thalassiosira pseudonana
Methanopyrus kandleri AV19
Pyrobaculum aerophilum IM2
Aciduliprofundum boonei T469
Korarchaeum cryptofilum OPF8
Caldiarchaeum subterraneum
Myxococcus xanthus DK1622
Nitrosopumilus maritimus SCM1 0
3
1
2
2
4
31
113
4
LCGC14AMP and
Lokiarchaeum (5/1)
RESEARCH ARTICLE
Actins are part of a gene family
ARP = actin related protein
42. 42
from sequence-based homology searches
(Fig. 1). Despite this variance, two features
are preserved between prokaryotic and
eukaryotic actins. The first common feature
in multistrand filament architectures. This
maintenance of contacts within a strand
suggests that the primordial actin filament
was single-stranded. In PNAS, Braun et al.
packing (6, 7). L
crographs (EMs)
indicated a struc
either single- or
(6). Now, Braun
in an 18-Å cry
crenactin can for
in vitro.
In determinin
sents a record o
actin filament,
functions must
appears to inte
proteins, the arca
as a cell shape-d
ment has some
actin homologs M
bulin homolog F
a dedicated cell s
idence from bact
filaments have a
sequences and f
evolved to becom
tion (1). Conseq
why crenactin fo
ment may be tha
mal for its role
Fig. 1. Relatedness of actins. The structures of actin protofilaments (2, 9–15) are shown below a maximum-
likelihood phylogenetic tree of the actin protein sequences. The structures are aligned via the central protomer, Author contributions: U.G
Actins are part of a larger gene family
43. 43
actin
crenactin
MreB
43
0.4
LCGC14AMP_05736710
Crenactin
LCGC14AMP and Lokiarchaeum (4/1)
Actin and related sequences
Arp2
LCGC14AMP (5)
Arp1
LCGC14AMP and
Lokiarchaeum
(11/1)
LCGC14AMP/Lokiarchaeum (11/2)
LCGC14AMP (2)
Arp3
LCGC14AMP (2)
LCGC14AMP_06532160
100
100
51
83
100
100
96
100
100
100
100
a c
b
68
Pyrobaculum aerophilum IM2
Aciduliprofundum boonei T469
Korarchaeum cryptofilum OPF8
Caldiarchaeum subterraneum
Myxococcus xanthus DK1622
Nitrosopumilus maritimus SCM1 0
3
1
2
2
4
LCGC14AMP and
Lokiarchaeum (5/1)
RESEARCH ARTICLE
Actins are part of a gene family
Actin and Arp 2/3 required for motility
Arps = actin related proteins (and are not actin)
44. 44
centerofthetree)but,exceptincarefullycalibratedcases,thisrelationshipis
not defined and probably varies between different parts of the tree.
defining subgroups by the deepest strongly supported node. Modified,
with permission, from Ref. [3].
Sc ARP4
Sp P23A10.08
Sp C23D3.09
Ce ZK616.4
Dm CG6546
Hs BAF53b
Mm BAF53a
Hs BAF53a
Sc ARP7
Sc ARP9
Sp C1071.06
Ce F42C5.9
At 8843903
Sc ARP8Sp C664.02Dm CG7846
Mm
12857259
Hs 104344709
Dm
CG12235
Hs 'ARP11'Ce
C49H3.8
M
m
'ARP11'
Sp
C56F2
Sc
ARP10
HsARP5
Dm
CG7940
ScARP5
ScARP6
CeARP6
Dmactin13E
GgARPX
Mm
'Actlike7b'
Hs'Actlike7b'
AtARP3
ScARP3
Dm
actin66b
Hs ARP3Mm 12835802
At ARP2
Sc ARP2
Dd ARP2
Dm ARP14D
ScARP1NcARP1
DmARP87C
Nc
Ro7
At12321978
Os13486900
SpBC365.10
At6091748
SpCC550.12
Mm12842577
HsARPX
At11276982Hs11137605
Mm12838437
Hs10178893
Mm
'Actlike7a'
Hs'Actlike7a'
Hs ARP3b
Sp
ARP3
D
d
AR
P3
Ac
AR
P3
CeY71F9AL.16
NcARP3
Sp ARP2
Ac ARP2
Gg ARP2
Ce K07C5.1
Hs ARP2
M
m
12840619
Hs 13383265
M
m
12840134
CeY53F4B.22
AnARP1
Sp
ARP1
HsARP1bMmARP1b
MmARP1a
HsARP1a
TgActin
GlActin
DmARP53d
PfActin
SpActinScActin
AtActin
Dd actin
Metazoan
actins
Conventional
actins
ARP1
Dynein motility
(dynactin complex)
ARP3
Actin polymerization
(ARP 2/3 complex)
ARP6
Nuclear?
ARP5
Chromatin
remodeling
ARP4
Chromatin
remodeling
ARP8
Chromatin
remodeling
ARP10?
Dynein motility
(dynactin complex)
Confidence estimates:
ARP2
Actin polymerization
(ARP2/3 complex) At 18394608
45. 45
actin
arps
crenactin
45
0.4
LCGC14AMP_05736710
Crenactin
LCGC14AMP and Lokiarchaeum (4/1)
Actin and related sequences
Arp2
LCGC14AMP (5)
Arp1
LCGC14AMP and
Lokiarchaeum
(11/1)
LCGC14AMP/Lokiarchaeum (11/2)
LCGC14AMP (2)
Arp3
LCGC14AMP (2)
LCGC14AMP_06532160
100
100
51
83
100
100
96
100
100
100
100
a c
b
68
61
Caldiarchaeum subterraneum
Myxococcus xanthus DK1622
Nitrosopumilus maritimus SCM1 0
3
1
LCGC14AMP and
Lokiarchaeum (5/1)
RESEARCH ARTICLE
How confident are you in the functions
of the Loki actin homologs based on
this tree?
46. Some types of Protein Databases
Database Advantages Problems
nr (Translated GenBank
sequences)
• Everybody can submit data • Many errors, because there is no
manual inspection
• no additional information links
• redundant
UniProt (Trembl) • non-redundant dataset derived from
GenBank, DDBJ and Embl
• Links to additional information
• GO term annotations
Many errors, because there is no
manual inspection
RefSeq • mostly fully sequenced organisms
• data submitted by genome projects
• some entries are reviewed
• less links to other databases
• Not so many sequences as in nr
and Trembl
UniProt (SwissProt) • All entries reviewed
• Links to additional information
• Not so many sequences as in nr
and Trembl and RefSeq
Annika Joecker
Max-Planck Institute for Plant Breeding Research
47. Annika Joecker
Max-Planck Institute for Plant Breeding Research
Sources of Information
Many types of databases are used for genome annotation
49. CDD – Conserved Domain Database
• Contains protein domain models imported from Pfam,
SMART, COG (clusters of orthologous genes), KOG (euk
COGs)
• Curated and provided at NCBI
• Search tool: RPSBlast
• 27036 PSSMs (Position specific scoring matrices) (Dec
2008)
–Count amino acids at each position in multiple alignment
–Compute percentage
–Compute log ratio
Annika Joecker
Max-Planck Institute for Plant Breeding Research
50. Protein Domain Search: InterPro
• Database of protein families, domains and functional
sites
• Hosted at the European Bioinformatics Institute (EBI)
• Consortium of member databases (PROSITE, Pfam,
Prints, ProDom, SMART and TIGRFAMs,
Superfamily, Panther)
• Tool for searching: InterProScan
• http://www.ebi.ac.uk/Tools/InterProScan/
Annika Joecker
Max-Planck Institute for Plant Breeding Research
51. KEGG – Kyoto Encyclopedia of Genes
and Genomes
• Comprehensive database of biological
information:
–KEGG GENES: genes and proteins
–KEGG LIGAND: endogenous & exogenous
chemical building blocks
–KEGG PATHWAY: biochemical pathways
–KEGG BRITE: KEGG-based ontology
• Web and stand-alone based tools
52. • A way to capture
biological knowledge
in a written and
computable form
The Gene Ontology
• A set of concepts
and their relationships
to each other arranged
as a hierarchy
www.ebi.ac.uk/QuickGO
Less specific concepts
More specific concepts
53. Ontologies: The Scope of GO
1. Molecular Function
e.g. protein kinase activity
2. Biological Process
e.g. cell cycle
3. Cellular Component
e.g. mitochondrion
GO terms aim to describe the ‘normal’ functions/ processes/locations that gene
products are involved in
NO: pathological processes, experimental conditions or temporal information
55. Microbes Online
www.microbesonline.org
• Excellent resource for
microbial genome data
• Precomputed ortholog/
paralog searches
• Aligned protein
sequences for
phylogenetic analysis
• Pathway-based
organization of data
63. How to annotate metagenomic data?
all the ways we’ve discussed before
(think homologs....)
64. phylogenetic “binning”
40.2 Methods for the Phylogenetic Binning of Metagenome Sequence Samples
root
Gammaproteobacteria
Proteobacteria
Deltaproteobacteria
Epsilonproteobacteria
Betaproteobacteria
Alphaproteobacteria
Bacteroidia
Bacteroidetes
Firmicutes Bacilli
Clostridia
Archaea
Euryarchaeota
Thermoprotein
Methanomicrobia
Bacteria
Actinobacteria (class)
Actinobacteria
Cyanobacteria
Spirochaetes
Actinobacteria (class)
Actinobacteria
root
Archaea
Bacilli
Euryarchaeota
Methanobacteria
Bacteria
Firmicutes
Clostridia
(B)
(A)
Figure 40.1 Comparison of
composition of public database
microbial community analyzed
sequencing. (A) Taxonomic co
finished genomes present in Ge
May 2009: The large bias towa
Gammaproteobacteria is caused
by 164 genome sequences of E
strains. (B) Taxonomic compos
populations in the human gut e
genbank
metagenomic data
from human gut
1. homology-based assignment of
reads
(e.g., BLAST)
2. compositionally-based assignment
(e.g, %G+C, or nucleotide
frequencies - stretches of 2-9 nts)
assigning genomic data to different groups of organisms