This document discusses various methods for annotating genomes after sequencing and assembly. Sequence analysis approaches like identifying open reading frames can rapidly and inexpensively find some genes, but have weaknesses like false positives and missing short genes. More accurate methods are needed to find non-coding RNAs, pseudogenes, and other elements. As sequencing technologies generate more data, the bottleneck has shifted to analysis, requiring skills in both biology and mathematics. The document provides an example sequence to annotate and poses questions about fast, cheap and accurate annotation methods.
This document discusses gene prediction and promoter prediction. It begins by explaining that gene prediction involves locating protein-coding genes within sequenced genomes in order to understand their functional content. Various computational methods are used for gene prediction, including searching for signals like start/stop codons, searching coding content, and comparing sequences to find homologs. Promoter prediction involves locating DNA elements that regulate gene expression and is challenging due to diversity and short, conserved motifs. Ab initio and comparative phylogenetic footprinting methods are used to predict promoters and regulatory elements in prokaryotes and eukaryotes.
Systems biology & Approaches of genomics and proteomicssonam786
This presentation provides the basic understanding of varous genomics and proteomics techniques.Systems biology studies life as a system .It includes the study of living system using various omic technologies .
S.Prasanth Kumar is a bioinformatician who studies proteomics, 2D-PAGE, and proteome databases. Proteomics involves the study of proteins expressed by a genome through analysis of protein sequences, structures, modifications, and interactions. Major databases include Swiss-Prot, which contains annotated protein sequences, and TrEMBL, which contains automatically generated sequences. Other databases contain information on protein families and domains, nucleotide sequences, 2D-PAGE gel images, and post-translational modifications.
This document discusses different methods for genome sequencing and assembly, including restriction enzyme fingerprinting, marker sequences, and hybridization assays. It focuses on using marker sequences like sequence-tagged sites (STS), expressed sequence tags (ESTs), untranslated regions (UTRs), and single nucleotide polymorphisms (SNPs) to map genomes. Large-insert cloning vectors like BACs and PACs can be used with restriction enzyme fingerprinting and FPC software to assemble contigs and map genomes at a large scale. Marker sequences provide a dense set of physical markers to build accurate physical maps of genomes.
The document provides an overview of plant genome sequence assembly, including:
1) A brief history of sequencing technologies and their improvements over time, from Sanger sequencing to newer technologies producing longer reads.
2) Key steps in a sequencing project including read processing, filtering, and corrections before assembly into contigs and scaffolds using appropriate software.
3) Factors to consider for experimental design and assembly optimization such as sequencing depth, library types, and software choices depending on the genome and data characteristics.
Automated sequencing of genomes require automated gene assignment
Includes detection of open reading frames (ORFs)
Identification of the introns and exons
Gene prediction a very difficult problem in pattern recognition
Coding regions generally do not have conserved sequences
Much progress made with prokaryotic gene prediction
Eukaryotic genes more difficult to predict correctly
This document provides an overview and introduction to RNA-seq analysis using Next Generation Sequencing. It discusses the RNA-seq workflow including mapping reads with TopHat2, transcript assembly with Cufflinks, and differential expression analysis. Key points covered include the advantages of RNA-seq over microarrays, the exponential drop in sequencing costs, mapping strategies for junction reads including TopHat, and running TopHat from the command line.
This document discusses gene prediction and promoter prediction. It begins by explaining that gene prediction involves locating protein-coding genes within sequenced genomes in order to understand their functional content. Various computational methods are used for gene prediction, including searching for signals like start/stop codons, searching coding content, and comparing sequences to find homologs. Promoter prediction involves locating DNA elements that regulate gene expression and is challenging due to diversity and short, conserved motifs. Ab initio and comparative phylogenetic footprinting methods are used to predict promoters and regulatory elements in prokaryotes and eukaryotes.
Systems biology & Approaches of genomics and proteomicssonam786
This presentation provides the basic understanding of varous genomics and proteomics techniques.Systems biology studies life as a system .It includes the study of living system using various omic technologies .
S.Prasanth Kumar is a bioinformatician who studies proteomics, 2D-PAGE, and proteome databases. Proteomics involves the study of proteins expressed by a genome through analysis of protein sequences, structures, modifications, and interactions. Major databases include Swiss-Prot, which contains annotated protein sequences, and TrEMBL, which contains automatically generated sequences. Other databases contain information on protein families and domains, nucleotide sequences, 2D-PAGE gel images, and post-translational modifications.
This document discusses different methods for genome sequencing and assembly, including restriction enzyme fingerprinting, marker sequences, and hybridization assays. It focuses on using marker sequences like sequence-tagged sites (STS), expressed sequence tags (ESTs), untranslated regions (UTRs), and single nucleotide polymorphisms (SNPs) to map genomes. Large-insert cloning vectors like BACs and PACs can be used with restriction enzyme fingerprinting and FPC software to assemble contigs and map genomes at a large scale. Marker sequences provide a dense set of physical markers to build accurate physical maps of genomes.
The document provides an overview of plant genome sequence assembly, including:
1) A brief history of sequencing technologies and their improvements over time, from Sanger sequencing to newer technologies producing longer reads.
2) Key steps in a sequencing project including read processing, filtering, and corrections before assembly into contigs and scaffolds using appropriate software.
3) Factors to consider for experimental design and assembly optimization such as sequencing depth, library types, and software choices depending on the genome and data characteristics.
Automated sequencing of genomes require automated gene assignment
Includes detection of open reading frames (ORFs)
Identification of the introns and exons
Gene prediction a very difficult problem in pattern recognition
Coding regions generally do not have conserved sequences
Much progress made with prokaryotic gene prediction
Eukaryotic genes more difficult to predict correctly
This document provides an overview and introduction to RNA-seq analysis using Next Generation Sequencing. It discusses the RNA-seq workflow including mapping reads with TopHat2, transcript assembly with Cufflinks, and differential expression analysis. Key points covered include the advantages of RNA-seq over microarrays, the exponential drop in sequencing costs, mapping strategies for junction reads including TopHat, and running TopHat from the command line.
This document summarizes different computational methods for protein structure prediction, including homology modeling, fold recognition, threading, and ab initio modeling. Homology modeling relies on identifying proteins with similar sequences and known structures. Fold recognition and threading can be used when there are no homologs, to identify proteins with the same overall fold but different sequences. Ab initio modeling uses physics-based modeling and protein fragments to predict structure from sequence alone, and has challenges due to the vast number of possible conformations.
The document discusses transcriptomics and the relationship between transcriptome size and organism complexity. It questions how gene expression contributes to transcriptome size and what new studies reveal about size and complexity. Specifically, it notes that alternative splicing and RNA editing increase transcriptome size and complexity. It also discusses that the human genome is pervasively transcribed, with one stretch of DNA encoding many RNAs, including microRNAs, which control mRNA expression and are involved in development, gene regulation, and diseases like cancer.
The document discusses the UCSC Genome Browser, an online tool for viewing and interacting with genomic data. It allows users to view multiple data sources simultaneously for a genomic region across many organisms. The document covers basic usage, uploading temporary custom tracks, creating permanent track hubs to host data, and sharing views using saved sessions. Track hubs and sessions allow sharing genomic views and custom data without time limits.
After sequencing of the genome has been done, the first thing that comes to mind is "Where are the genes?". Genome annotation is the process of attaching information to the biological sequences. It is an active area of research and it would help scientists a lot to undergo with their wet lab projects once they know the coding parts of a genome.
Computational biology involves using computational techniques like data analysis, modeling and simulation to study biological systems. Bioinformatics specifically develops tools to analyze biological data. Other computational biology fields include computational anatomy, genomics, neuroscience, pharmacology, and evolutionary biology which all apply computational methods to study anatomical structures, genomes, the brain, drug effects, and evolution respectively. Cancer computational biology aims to predict cancer mutations by analyzing large biological datasets.
introduction to upgma software , its history and origination, basic mening of upgma, the upgma algorithm, steps to perform upgma, and its diagramatic representation of the process along with an example, its application, advantages along with the disadvantages, and its uses.
This document discusses gene identification and genome annotation. It describes how gene finding in eukaryotes is difficult due to smaller percentages of genes in genomes like humans, and larger intron sizes. It covers open reading frames, complications with introns, and the use of six-frame translation to find protein coding sequences. Software tools for structural and functional annotation are outlined, including identifying genes through homology searching and ab initio prediction using hidden Markov models. The accuracy challenges of ab initio prediction are also summarized.
The Protein Data Bank (PDB) is an open database that archives 3D structural data of biological macromolecules. It was established in 1971 and currently holds over 150,000 structures determined by X-ray crystallography or NMR spectroscopy. The PDB is overseen by the Worldwide Protein Data Bank and freely accessible online. It serves as a key resource for structural biology and many other databases rely on protein structures deposited in the PDB.
Bioinformatics emerged from the marriage of computer science and molecular biology to analyze massive amounts of biological data, like that produced by the Human Genome Project. It uses algorithms and techniques from computer science to solve problems in molecular biology, like comparing genomic sequences to understand evolution. As genomic data exploded publicly, bioinformatics was needed to efficiently store, analyze, and make sense of this information, which has applications in molecular medicine, drug development, agriculture, and more.
Scoring schemes in bioinformatics (blosum)SumatiHajela
This document discusses scoring schemes in bioinformatics, specifically BLOSUM (BLOcks SUbstitution Matrix). It introduces BLOSUM, describing that it is based on conserved amino acid patterns from multiple sequence alignments. It then explains the BLOSUM-62 matrix and the BLOSUM scoring algorithm. The document contrasts BLOSUM with PAM matrices, noting key differences like BLOSUM being based on direct observations while PAM uses evolutionary modeling. Finally, it outlines the significance of scoring matrices for detecting distant evolutionary relationships between protein sequences.
SAGE- Serial Analysis of Gene ExpressionAashish Patel
Serial Analysis of Gene Expression (SAGE) is a method to quantify gene expression in cells. It involves extracting short sequence tags from mRNA transcripts and concatenating them for efficient sequencing. This allows simultaneous analysis of thousands of transcripts. SAGE provides quantitative gene expression data without prior knowledge of genes and can identify differentially expressed genes between cell types or conditions. While powerful, it requires substantial sequencing and computational analysis of large datasets.
The document describes several key databases within the KEGG resource, including:
- The PATHWAY database containing molecular network maps of metabolic and genetic pathways.
- The BRITE database providing hierarchical classifications of biological systems beyond what is shown in pathways.
- The LIGAND database consisting of chemical compounds, carbohydrates, reactions, and enzyme information.
KEGG aims to comprehensively capture biological knowledge through integrated databases covering genomes, pathways, diseases and drugs.
Open reading frame is part of reading frame that contains no stop codons or region of amino acids coding triple codons.
ORF starts with start codon and ends at stop codon.
protein structure prediction methods. homology modelling, fold recognition, threading, ab initio methods. in short and easy form slides. after one time read you can easily understand methods for protein structure prediction.
Systems biology is the computational and mathematical modeling of complex biological systems. It is a biology-based interdisciplinary field of study that focuses on complex interactions within biological systems, using a holistic approach (holism instead of the more traditional reductionism) to biological research.
Gene prediction is the process of determining where a coding gene might be in a genomic sequence. Functional proteins must begin with a Start codon (where DNA transcription begins), and end with a Stop codon (where transcription ends).
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...Torsten Seemann
This document discusses de novo genome assembly, which is the process of reconstructing long genomic sequences from many short sequencing reads without the aid of a reference genome. It is challenging due to factors like short read lengths, repetitive sequences that complicate the assembly graph, and sequencing errors. The goals of assembly are to produce contiguous sequences with high completeness and correctness by resolving overlaps between reads into consensus sequences. Metrics like N50, core gene content, and read remapping are used to assess assembly quality.
The Ensembl genome browser is a web-based tool that allows researchers to visualize and analyze genomic data. It was launched in 1999 by the Ensembl project, a joint initiative between EMBL's European Bioinformatics Institute and the Wellcome Sanger Institute. Ensembl contains genome data for humans and many other species, allowing users to browse genes, view their molecular functions, and utilize tools for variant effect prediction, data mining, and more. Key features include separate browsing options for domains like fungi, plants, animals, and bacteria.
Genome annotation, NGS sequence data, decoding sequence information, The genome contains all the biological information required to build and maintain any given living organism.
This document summarizes different computational methods for protein structure prediction, including homology modeling, fold recognition, threading, and ab initio modeling. Homology modeling relies on identifying proteins with similar sequences and known structures. Fold recognition and threading can be used when there are no homologs, to identify proteins with the same overall fold but different sequences. Ab initio modeling uses physics-based modeling and protein fragments to predict structure from sequence alone, and has challenges due to the vast number of possible conformations.
The document discusses transcriptomics and the relationship between transcriptome size and organism complexity. It questions how gene expression contributes to transcriptome size and what new studies reveal about size and complexity. Specifically, it notes that alternative splicing and RNA editing increase transcriptome size and complexity. It also discusses that the human genome is pervasively transcribed, with one stretch of DNA encoding many RNAs, including microRNAs, which control mRNA expression and are involved in development, gene regulation, and diseases like cancer.
The document discusses the UCSC Genome Browser, an online tool for viewing and interacting with genomic data. It allows users to view multiple data sources simultaneously for a genomic region across many organisms. The document covers basic usage, uploading temporary custom tracks, creating permanent track hubs to host data, and sharing views using saved sessions. Track hubs and sessions allow sharing genomic views and custom data without time limits.
After sequencing of the genome has been done, the first thing that comes to mind is "Where are the genes?". Genome annotation is the process of attaching information to the biological sequences. It is an active area of research and it would help scientists a lot to undergo with their wet lab projects once they know the coding parts of a genome.
Computational biology involves using computational techniques like data analysis, modeling and simulation to study biological systems. Bioinformatics specifically develops tools to analyze biological data. Other computational biology fields include computational anatomy, genomics, neuroscience, pharmacology, and evolutionary biology which all apply computational methods to study anatomical structures, genomes, the brain, drug effects, and evolution respectively. Cancer computational biology aims to predict cancer mutations by analyzing large biological datasets.
introduction to upgma software , its history and origination, basic mening of upgma, the upgma algorithm, steps to perform upgma, and its diagramatic representation of the process along with an example, its application, advantages along with the disadvantages, and its uses.
This document discusses gene identification and genome annotation. It describes how gene finding in eukaryotes is difficult due to smaller percentages of genes in genomes like humans, and larger intron sizes. It covers open reading frames, complications with introns, and the use of six-frame translation to find protein coding sequences. Software tools for structural and functional annotation are outlined, including identifying genes through homology searching and ab initio prediction using hidden Markov models. The accuracy challenges of ab initio prediction are also summarized.
The Protein Data Bank (PDB) is an open database that archives 3D structural data of biological macromolecules. It was established in 1971 and currently holds over 150,000 structures determined by X-ray crystallography or NMR spectroscopy. The PDB is overseen by the Worldwide Protein Data Bank and freely accessible online. It serves as a key resource for structural biology and many other databases rely on protein structures deposited in the PDB.
Bioinformatics emerged from the marriage of computer science and molecular biology to analyze massive amounts of biological data, like that produced by the Human Genome Project. It uses algorithms and techniques from computer science to solve problems in molecular biology, like comparing genomic sequences to understand evolution. As genomic data exploded publicly, bioinformatics was needed to efficiently store, analyze, and make sense of this information, which has applications in molecular medicine, drug development, agriculture, and more.
Scoring schemes in bioinformatics (blosum)SumatiHajela
This document discusses scoring schemes in bioinformatics, specifically BLOSUM (BLOcks SUbstitution Matrix). It introduces BLOSUM, describing that it is based on conserved amino acid patterns from multiple sequence alignments. It then explains the BLOSUM-62 matrix and the BLOSUM scoring algorithm. The document contrasts BLOSUM with PAM matrices, noting key differences like BLOSUM being based on direct observations while PAM uses evolutionary modeling. Finally, it outlines the significance of scoring matrices for detecting distant evolutionary relationships between protein sequences.
SAGE- Serial Analysis of Gene ExpressionAashish Patel
Serial Analysis of Gene Expression (SAGE) is a method to quantify gene expression in cells. It involves extracting short sequence tags from mRNA transcripts and concatenating them for efficient sequencing. This allows simultaneous analysis of thousands of transcripts. SAGE provides quantitative gene expression data without prior knowledge of genes and can identify differentially expressed genes between cell types or conditions. While powerful, it requires substantial sequencing and computational analysis of large datasets.
The document describes several key databases within the KEGG resource, including:
- The PATHWAY database containing molecular network maps of metabolic and genetic pathways.
- The BRITE database providing hierarchical classifications of biological systems beyond what is shown in pathways.
- The LIGAND database consisting of chemical compounds, carbohydrates, reactions, and enzyme information.
KEGG aims to comprehensively capture biological knowledge through integrated databases covering genomes, pathways, diseases and drugs.
Open reading frame is part of reading frame that contains no stop codons or region of amino acids coding triple codons.
ORF starts with start codon and ends at stop codon.
protein structure prediction methods. homology modelling, fold recognition, threading, ab initio methods. in short and easy form slides. after one time read you can easily understand methods for protein structure prediction.
Systems biology is the computational and mathematical modeling of complex biological systems. It is a biology-based interdisciplinary field of study that focuses on complex interactions within biological systems, using a holistic approach (holism instead of the more traditional reductionism) to biological research.
Gene prediction is the process of determining where a coding gene might be in a genomic sequence. Functional proteins must begin with a Start codon (where DNA transcription begins), and end with a Stop codon (where transcription ends).
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...Torsten Seemann
This document discusses de novo genome assembly, which is the process of reconstructing long genomic sequences from many short sequencing reads without the aid of a reference genome. It is challenging due to factors like short read lengths, repetitive sequences that complicate the assembly graph, and sequencing errors. The goals of assembly are to produce contiguous sequences with high completeness and correctness by resolving overlaps between reads into consensus sequences. Metrics like N50, core gene content, and read remapping are used to assess assembly quality.
The Ensembl genome browser is a web-based tool that allows researchers to visualize and analyze genomic data. It was launched in 1999 by the Ensembl project, a joint initiative between EMBL's European Bioinformatics Institute and the Wellcome Sanger Institute. Ensembl contains genome data for humans and many other species, allowing users to browse genes, view their molecular functions, and utilize tools for variant effect prediction, data mining, and more. Key features include separate browsing options for domains like fungi, plants, animals, and bacteria.
Genome annotation, NGS sequence data, decoding sequence information, The genome contains all the biological information required to build and maintain any given living organism.
This document discusses statistical approaches to gene prediction. It begins by outlining the central dogma of biology and the discovery of codons. Later sections discuss key findings such as the discovery of split genes consisting of exons and introns, and the splicing mechanism by which introns are removed from mRNA. The document compares two main approaches to computational gene prediction: statistical methods that analyze sequence patterns in exons versus introns, and similarity-based methods that leverage similarities to known genes from other species.
This document discusses gene identification and discovery. It begins by describing gene identification in prokaryotes, unicellular eukaryotes, and multicellular eukaryotes. It then discusses the components of protein-coding genes and different approaches to gene prediction for prokaryotes and eukaryotes. The document also covers gene structure in prokaryotes and eukaryotes, as well as software tools and methods for gene prediction. Finally, it discusses several approaches to classifying genes by function.
Talk at the Salk Institute's 2012 Systems to Synthesis Symposium. Discusses the use of online games with the purpose of annotating the human genome and building better phenotype predictors.
This document provides an overview of functional genomics and methods for transcriptome analysis. It discusses two main approaches - sequence-based approaches like expressed sequence tags (ESTs) and serial analysis of gene expression (SAGE), and microarray-based approaches. For sequence-based approaches, it describes how ESTs can provide gene discovery and expression information but have limitations. It outlines the SAGE methodology and gene index construction to organize EST data. For microarrays, it summarizes the basic workflow including sample preparation, hybridization, image analysis and data normalization to identify differentially expressed genes through statistical tests.
The document describes an experiment investigating how loss of the CHD1 gene affects position effect variegation (PEV) in Drosophila melanogaster. PEV is when a euchromatic gene like white, which normally expresses, is silenced in some cells when placed near heterochromatin. The experiment found that homozygous loss of the CHD1 gene acted as an enhancer of variegation for white inserted in the telomere but as a suppressor of variegation for the white mottled 4h line. This suggests CHD1 loss has different effects on PEV depending on the chromosomal location of the white gene. Future studies are proposed to further quantify and understand these location-dependent
Characterization of Drosophila Nucleobindin: An Evolutionarily Conserved Ca2+...Vadivel Prabahar
This document summarizes the Ph.D. thesis work of Vadivel Prabahar carried out from 2007-2014 at the Indian Institute of Technology Madras. The thesis characterized Drosophila nucleobindin (Dmnucb), a calcium- and zinc-binding protein. Spectroscopy experiments revealed that Dmnucb undergoes distinct conformational changes when binding calcium versus zinc. A variant of Dmnucb, Dmnuc30, was also found to bind to Gαi1 protein in a calcium-dependent manner through a conserved G protein binding motif. The thesis provided evidence that calcium and zinc induce different structural changes in Dmnucb through effects on tryptophan residues and secondary structure.
This study characterized the Dvilp7 gene from Drosophila virilis through a series of experiments. RNA was purified from D. virilis and used to construct cDNA. RACE experiments were used to amplify the 5' and 3' ends of the Dvilp7 cDNA sequence. The full Dvilp7 cDNA sequence was assembled and found to encode a putative protein with a signal peptide. Genomic DNA was also sequenced and compared to determine intron sequences. Characterizing the Dvilp7 gene expands understanding of the genetic mechanisms regulating insulin signaling in Drosophila.
The engrailed gene is a segment polarity gene in Drosophila melanogaster that plays several important roles during development. It defines the posterior region of each embryonic parasegment, establishing anterior-posterior polarity. The engrailed gene also helps pattern the brain by defining borders between regions and guiding neuronal axon growth. Comparisons of engrailed DNA and protein sequences across species show it is conserved and related genes can be found in vertebrates as well.
This document discusses the need for annotation of genomic data given the deluge of information from next generation sequencing. It outlines that clinical-grade annotation is important for application. Many sources of annotation are discussed, including databases, literature, testing labs, and crowdsourcing. However, it emphasizes that specialized human curation remains essential for high quality annotation.
Identification, annotation and visualisation of extreme changes in splicing w...Mar Gonzàlez-Porta
Talk for the ECCB'14 workshop: Analysis of differential isoform usage by RNA-seq: statistical methodologies and open software - Strasbourg, 7th September 2014
DIYA: An annotation pipeline for any genomics labAndrew Stewart
The document describes DIYA, an open source pipeline for annotating genomic sequences. The pipeline takes DNA contigs as input and produces a fully annotated sequence as output. It is modular and expandable. The pipeline includes steps for assembly, gene prediction, BLAST searches, and identification of non-coding RNA and tRNA. The software is designed to be decentralized and accessible for smaller genomics labs.
Web Apollo: A Web-based Genomic Annotation Editing Platform ISB2013Monica Munoz-Torres
More efficient sequencing technologies mean a dramatic increase in our access to whole genome sequences, and annotation efforts must adapt to keep pace in converting these sequence data into knowledge. The growing number of genome sequencing projects also means there will be a larger reliance on contributions from domain specialists. This is indicative of a curation environment shifting from a traditional centralized model to a geographically dispersed community annotation model, which requires new tools to support collaborative annotation. WebApollo is a successor to the Apollo annotation editor; it provides a web-based environment that allows multiple distributed users to review, edit, and share manual annotations. The WebApollo client is designed as an extension to JBrowse, a genome browser that provides a fast, highly interactive interface for visualization of genomic data. WebApollo allows users to create and modify transcript and exon structures through intuitive gestures, and flags potential problems within these manual annotations.
Web Apollo: A Web-based Genomics Annotation Editing Platform. 13ArthGenMonica Munoz-Torres
This is the talk I gave at the 7th Arthropod Genomics Symposium, hosted by the Eck Institute for Global Health at University of Notre Dame in South Bend, Indiana, USA.
More efficient sequencing technologies mean a dramatic increase in our access to whole genome sequences, and annotation efforts must adapt to keep pace in converting these sequence data into knowledge. The growing number of genome sequencing projects also means there will be a larger reliance on contributions from domain specialists. This is indicative of a curation environment shifting from a traditional centralized model to a geographically dispersed community annotation model, which requires new tools to support collaborative annotation. WebApollo is a successor to the Apollo annotation editor; it provides a web-based environment that allows multiple distributed users to review, edit, and share manual annotations. The WebApollo client is designed as an extension to JBrowse, a genome browser that provides a fast, highly interactive interface for visualization of genomic data. WebApollo allows users to create and modify transcript and exon structures through intuitive gestures, and flags potential problems within these manual annotations.
A talk that I gave to a a general audience at UC Davis. Slides were also used for Prof. Ian Korf's presentation at the Genome 10K workshop (May 25th, 2013). This talk mostly concerns the results of the Assemblathon 2 contest, but also covers other issues relating to genome assembly.
Note, this talk has been superseded by updated versions (also available on slideshare)!
ESTs are short sequences of DNA that represent genes expressed in certain tissues or organisms. They provide a quick and inexpensive way for scientists to discover new genes and map their positions in genomes. ESTs represent a snapshot of genes expressed in a tissue at a given time. Sequencing the beginning or end of cDNA clones produces 5' and 3' ESTs, which can help identify genes and study gene expression and regulation.
GRC Workshop at Churchill College on Sep 21, 2014. This is Paul Kitt's talk describing the NCBI approach to annotation the full human reference assembly.
Apollo is a web-based application that supports and enables collaborative genome curation in real time, allowing teams of curators to improve on existing automated gene models through an intuitive interface. Apollo allows researchers to break down large amounts of data into manageable portions to mobilize groups of researchers with shared interests.
The i5K, an initiative to sequence the genomes of 5,000 insect and related arthropod species, is a broad and inclusive effort that seeks to involve scientists from around the world in their genome curation process, and Apollo is serving as the platform to empower this community.
This presentation is an introduction to Apollo for the members of the i5K Pilot Project working on species of the order Hemiptera.
Three groups annotated the genome of Mycoplasma genitalium and found inconsistencies in their annotations. Of the 468 genes, 318 were annotated consistently by all three groups but 45 had conflicting annotations. Errors likely arose from insufficient sequence similarity to determine homology accurately or incorrectly inferring function based on homology alone. Database curation is needed to prevent propagation of erroneous annotations.
Avian Biology Research free Abstracts edition. ABR covers all aspects of avian research, from birdringing to poultry breeding. Ornithology, birds, birdbanding, avian research, nesting and more!
The document provides information about genomics and the Human Genome Project. It defines genomics as the study of the structure and function of entire genomes. It describes the goals of the Human Genome Project as identifying all human genes, determining DNA sequences, and making the data publicly available. Sequencing techniques used include shotgun sequencing and Sanger sequencing. The document also discusses how DNA is amplified and prepared for sequencing.
Optimizing Grape Rootstock Production and Export of inhibitors of X. fastidio...huyng
This document discusses using transgenic rootstocks expressing polygalacturonase-inhibiting proteins (PGIPs) to control Pierce's disease in grapevines caused by Xylella fastidiosa. PGIPs inhibit the pectin-degrading enzyme polygalacturonase (PG) produced by X. fastidiosa that is important for disease development. The objectives are to identify effective PGIPs, optimize their expression in roots, and test if they are transported through the graft junction into scions to inhibit X. fastidiosa systemic movement and reduce Pierce's disease symptoms. Transgenic rootstocks will be evaluated for disease resistance as well as viticultural and wine quality impacts.
This document describes the development of a multiplex PCR assay targeting the cgcA gene, which encodes a diguanylate cyclase, to differentiate between species within the genus Cronobacter. Analysis of 12 Cronobacter genomes identified 7 conserved diguanylate cyclase-encoding genes, one of which, cgcA, showed species-specific divergence that matched known phylogenetic relationships between Cronobacter species. Primers were designed for this gene and tested in a multiplex PCR assay on 305 Cronobacter isolates representing 6 species. The assay correctly identified the species of all isolates tested and did not identify any of 20 non-Cronobacter species, demonstrating high specificity and sensitivity for rapid identification of Cronobacter.
The document discusses various applications of genetic engineering and new technologies in agriculture. It describes how genes control specific traits in plants and how genetic variability can be used to develop new crop varieties, like the columnar apple tree. It also discusses using gene editing to develop disease-resistant peppers and oilseed rape varieties that produce long-chain omega-3 fatty acids. Finally, it shows how robotics and automation are being used for tasks like harvesting asparagus and weeding to improve efficiency.
This work package aims to understand genome evolution in Phytophthora by analyzing UK diversity, origins of emerging strains, and mechanisms of adaptation such as hybridization and horizontal gene transfer. The researchers will sequence genomes to understand how Phytophthora adapts to new hosts, woody hosts, and the role of hybridization and gene transfer in evolution.
Investigation of phylogenic relationships of shrew populations using genetic...Juan Barrera
This study investigated the phylogenetic relationships of shrew populations using genetic markers. DNA was extracted from tissue samples of shrews from different geographic regions. Two mitochondrial genes, cytochrome c oxidase subunit 1 (COI) and cytochrome b (cyt-b), were amplified via PCR and sequenced. Genetic markers were used to identify species and construct a phylogenetic tree. The cyt-b gene was found to be more accurate for species identification and phylogenetic analysis of shrews. Some inconsistencies between the COI and cyt-b trees and BLAST results suggest that cyt-b data is more abundant in genetic databases for shrew species comparison.
Investigation of phylogenic relationships of shrew populations using genetic...Juan Barrera
This study investigated the phylogenetic relationships of shrew populations using genetic markers. DNA was extracted from tissue samples of shrews from different geographic regions. Two mitochondrial genes, COI and cyt-b, were amplified via PCR and sequenced. The genetic markers were used to identify species and construct a phylogenetic tree. Cyt-b was found to be more accurate for species identification and phylogenetic analysis of shrews. Some inconsistencies between the COI and cyt-b trees and BLAST results suggest that the genetic databases are more complete for cyt-b in shrews. The study provides insights into shrew diversity and evolution at the DNA level.
This study investigated the phylogenetic relationships of shrew populations using genetic markers. DNA was extracted from tissue samples of shrews from different geographic regions. Two mitochondrial genes, COI and cyt-b, were amplified and sequenced. The genetic markers were used to identify species and construct a phylogenetic tree. The study found that cyt-b more accurately identified species and analyzed phylogeny in shrews compared to COI. Some inconsistencies between the COI and cyt-b BLAST results and phylogenetic trees were observed, possibly due to limited shrew sequence data available in databases. The study provided insights into shrew diversity and evolutionary relationships at the DNA level.
The document discusses various types of genetic mutations including point mutations, deletions, insertions, inversions, substitutions, silent mutations, nonsense mutations, and frame shifts. It also summarizes the process of genetic engineering including extracting DNA from one organism, combining it with DNA from another, and introducing it into cells to produce a desired trait like insulin production. Additionally, it mentions cloning, the Flavr-Savr tomato as the first genetically modified food, proposals for featherless chickens, early human tissue engineering experiments, and the development of Golden Rice.
This document summarizes the progress of sequencing the Medicago truncatula genome. Key points include:
- M. truncatula is an important forage crop and model legume with a relatively small genome that is being sequenced to further legume research.
- Initial whole genome shotgun sequencing at low coverage identified highly repetitive regions. A mapped BAC approach is now being used.
- Over 1,000 BACs have been sequenced with a goal of completing the euchromatic regions of four chromosomes. Other chromosomes are being sequenced at other institutions.
- Gene predictions from the sequenced data find a gene density of about one gene per 6.5-7.6 kb and over 13,000 genes identified
This document contains a summary of Sandra G. Nishikawa's skills and experience. She has over 30 years of experience in molecular biology techniques including DNA/RNA isolation, PCR, cloning, sequencing, cell culture and more. She has worked as a research technician for various professors at the University of Calgary studying topics like hepatitis B virus, prion diseases, and cancer.
Making Protein Function and Subcellular Localization Predictions: Challenges ...fionabrinkman
The document discusses challenges and opportunities in predicting protein function and subcellular localization from sequence data alone. It outlines issues with current orthology-based and pathway-based prediction methods, and ways to improve functional predictions by differentiating true orthologs from non-orthologous relationships and developing better pathway signatures. The author advocates for databases like OrtholugeDB that pre-compute ortholog predictions across many genomes to facilitate large-scale evaluation of prediction methods.
Disrupted development and altered hormone signaling in male Padi2:Padi4 doubl...Cornell University
Background: Peptidylarginine deiminase enzymes (PADs) convert arginine residues to citrulline in a process called citrullination or deimination. Recently, two PADs, PAD2 and PAD4, have been linked to hormone signaling in vitro and the goal of this study was to test for links between PAD2/PAD4 and hormone signaling in vivo.
Methods: Preliminary analysis of Padi2 and Padi4 single knockout (SKO) mice did not find any overt reproductive defects and we predicted that this was likely due to genetic compensation. To test this hypothesis, we created a Padi2/Padi4 double knockout (DKO) mouse model and tested these mice for a range of reproductive defects. Results: Controlled breeding trials found that DKO breeding pairs, particularly males, appeared to take longer to have their first litter than wild-type FVB controls (WT), and that pups and DKO male weanlings weighed significantly less than their WT counterparts. Additionally, DKO males took significantly longer than WT males to reach puberty and had lower serum testosterone levels. Furthermore, DKO males had smaller testes than WT males with increased rates of germ cell apoptosis.
Conclusions: The Padi2/Padi4 DKO mouse model provides a new tool for investigating PAD function and outcomes from our studies provide the first in vivo evidence linking PADs with hormone signaling.
Talk on Phylogenomics for MBL Molecular Evolution Course 2004Jonathan Eisen
This document discusses phylogenomics and how analyzing genome sequences through an evolutionary lens can provide insights into how species evolve. It covers several topics: introducing phylogenomics and how evolutionary analysis is key to interpreting genomes; examples of phylogenomic studies of species evolution, uncultured organisms, and functional predictions; and the importance of increasing phylogenetic diversity in genome sequencing to better understand evolution. The document advocates for taking an evolutionary perspective in comparative genomic studies.
Genetic and Molecular Characterization of a Dental Pathogen Using a Genome-Wi...shabeel pn
This document discusses genetic and molecular characterization of Actinobacillus actinomycetemcomitans (A.a.), a dental pathogen, using genomic approaches. Key points include:
1) A.a.'s genome has been sequenced which will help study its iron acquisition systems, Fur and iron regulons, and virulence factors.
2) A rat model has been used to study A.a. pathogenesis and induced colonization, immune response, and bone loss similar to human infections.
3) Future studies aim to use genomics and DNA microarrays to better understand A.a. biology, host-pathogen interactions, and develop new therapies.
Poster_Molecular analysis of BRAF and RAS family genes in thyroid carcinoma i...Alexandra Papadopoulou
This study analyzed mutations in the BRAF and RAS family genes in 33 thyroid carcinoma samples from Greek patients. The samples represented the common types of thyroid cancer seen in the general population. DNA was extracted from tissue biopsies and tested for mutations in BRAF codon 600 and RAS codons 12, 13 and 61 via PCR and sequencing. No mutations were found in BRAF or the RAS genes tested. While the sample size was representative of thyroid cancer prevalence, the results need validation in a larger cohort given the reported mutually exclusivity of mutations and variability in prevalence depending on carcinoma type. A previous Greek study found higher mutation rates than reported literature, highlighting genetic differences between populations.
PREVALENCE OF CANCER ASSOCIATED GENES IN BREAST CANCER PATIENTS IN THE HOSPIT...Jagadish Hansa
This document summarizes a study on the prevalence of cancer-associated genes in breast cancer patients in southern Assam, India. The study surveyed and collected samples from hospitals to analyze the tissues of breast cancer patients through immunohistochemistry and gene amplification. Three mutations - 185delAG, 1014delGT, and 3889delAG - were identified in the BRCA1 gene in southern Assam patients. The mutations have been reported in diverse ethnic populations and are associated with increased risk of breast cancer. Further analysis of patient samples and genes is needed to understand the breast cancer risk factors in the region.
Big data biology for pythonistas: getting in on the genomics revolutionDarya Vanichkina
Slides for the talk I gave at PyCon Australia trying to simplify biology and genomics into something easily accessible for software developers and CompSci graduates.
I cover
1. What biological data looks like today
2. How the revolution in genomics sequencing technology is IN a hospital near you
3. How this is affecting patient treatment today
4. What are some of the major challenges in using this data in the clinic?
and ...
5. (1 slide about ) How my research fits into the paradigm of understanding human genetic variation.
The objectives are an understanding of:
▶ How “function” in a genomic context can be defined. This is a
surprisingly philosophical problem.
▶ Some strategies for determining if a genomic region is likely to
be functional or not.
Objectives are an understanding of:
▶ Homology search tools
▶ E-values
▶ how BLAST works
▶ how profile HMMs (hmmer) work
▶ which is the right tool for different questions
Sequence alignment & comparative genomics
▶ What’s the difference between homology and analogy?
▶ How homology is estimated?
▶ Where do sequence similarity scores come from?
▶ What are the BLOSUM protein scoring matrices?
▶ How are insertions and deletions scored?
Genome annotation & comparative genomics
An appreciation for:
▶ An overview of some techniques and methods are used for
comparative genomics
▶ An understanding of genome annotation methods, particularly
the advantages and disadvantages of the different methods:
▶ Sequence analysis (ORF finding)
▶ Comparative sequence analysis
▶ Experimental methods (RNAseq & mass-spectroscopy)
Does RNA avoidance dictate protein expression level?Paul Gardner
Selection against mRNA:ncRNA interactions is observed across bacterial and archaeal genomes. Stochastic interactions between abundant ncRNAs and mRNAs may influence protein expression levels by inhibiting translation. Analysis of highly conserved genes in over 1,500 bacterial and 100 archaeal genomes provides evidence that mRNA and ncRNA sequences have evolved to avoid stable interactions, suggesting such interactions are selectively avoided to prevent inaccurate regulation of protein levels.
1) The document discusses machine learning and random forests, a popular machine learning method.
2) Random forests use decision trees built from random subsets of variables and data, and aggregate results to improve accuracy.
3) Examples using random forests to classify iris data are shown, including evaluating variable importance and classification accuracy.
This document discusses clustering and classification techniques for analyzing multivariate data. It begins by explaining that multivariate data involves collecting measurements of multiple features for each item. The document then outlines two main styles of analysis: 1) classification (supervised learning), which involves using known class labels to develop a procedure for classifying new items, and 2) clustering (unsupervised learning), which aims to find groups within the data without known class labels. Common techniques are discussed for both classification and clustering. Examples of applications are also provided.
- Monte Carlo methods use randomness to solve problems by running simulations of random samples and processes. This allows evaluating the significance of observed statistics.
- The modern version was developed in the 1940s for use in nuclear weapons projects. It involves generating random samples from an assumed model and comparing statistics to observed values.
- As an example, points in spatial data can be tested for random distribution by comparing properties of the real data to randomly generated point data, like minimum distances between points.
The document discusses resampling techniques called the jackknife and bootstrap. The jackknife involves deleting each observation from the dataset and recalculating statistics to estimate bias, standard error, and confidence intervals. The bootstrap resamples the dataset with replacement many times to estimate properties of statistics like the mean. Both techniques are used to assess reliability of estimates and account for uncertainty without assumptions about the population distribution. The document provides examples applying these methods to estimate standard deviation, confidence intervals for the median, and properties of regression.
This document contains the notes from a lecture on contingency tables and related statistical methods. It introduces contingency tables and how they can be used to analyze relationships between variables. It discusses Fisher's exact test and the chi-squared test for assessing independence in contingency tables. Examples are provided to demonstrate contingency table analysis and visualization of results. Additional resampling methods like bootstrapping and Monte Carlo simulation are also mentioned.
1. The document discusses regression analysis and linear regression models. It defines key terms used in regression including the famous five values, sums of squares, variance, covariance, correlation, slope, and intercept.
2. Methods for assessing the explanatory power and fit of regression models are presented, including the coefficient of determination (r2), standard errors for slope and intercept, and partitioning total sum of squares.
3. The importance of model checking is emphasized through assessing residuals, influence points, and considering alternative transformations to improve linearity.
This document provides an overview of linear regression analysis. It defines key terms like residuals, error sum of squares (SSE), and the "famous five" values needed to perform regression (sums of x, y, x-squared, y-squared, and x times y). Linear regression finds the line of best fit by minimizing SSE. The slope of the regression line is calculated as the covariance (SSxy) divided by variance (SSx). Regression guarantees to pass through the point of mean x and mean y.
Analysis of covariation and correlationPaul Gardner
The document discusses correlation and covariance in biology. It defines correlation as a relationship between two or more variables, while covariance is a measure of how two random variables vary together. The document provides examples of calculating correlation coefficients between variables using formulas for covariance, variance, and the correlation coefficient r. It warns that correlation does not necessarily imply causation and provides tips for interpreting correlated data.
This document discusses statistical tests for comparing two samples, specifically Fisher's F test, Student's t-test, and Wilcoxon rank-sum test. It provides an example comparing ozone concentrations between two market gardens (Gardens A and B) using these tests. The F test showed the variances between gardens were not significantly different. The t-test and Wilcoxon test both found the mean ozone concentration was significantly higher in Garden B than Garden A.
The document discusses analyzing single samples and making inferences about population parameters. It addresses three key questions for single samples: what is the mean, is the mean significantly different from expectations, and how certain we are of the mean estimate. Parametric or non-parametric methods must be used depending on whether the data meets assumptions like normality. The normal distribution and central limit theorem allow drawing inferences from sample means. Examples demonstrate calculating probabilities for different parts of the normal distribution using z-scores and the standard normal distribution.
The document discusses measures of centrality in biology. It lists some common measures of centrality used to describe datasets, including the mean. It then shows a graph with many data points distributed around a central mean.
The document provides an overview of key concepts for the BIOL209: Fundamentals course, including:
- Instructor-provided slides had no impact on attendance but adversely affected exam performance.
- Note-taking and self-testing improves learning. Some students may experience math anxiety or stereotype threat.
- The scientific method involves forming hypotheses and testing them through experimentation and analysis of data.
- Understanding statistical and experimental design principles is important for reproducing and interpreting results. Randomization, replication, and controlling for confounding variables strengthen experimental conclusions.
Random RNA interactions control protein expression in prokaryotesPaul Gardner
Presented at the NZSBMB/NZMS Conference in Christchurch 2016
CustomScience Award
A core assumption of gene expression analysis is that mRNA abundances broadly correlate with protein abundance, but these two can be imperfectly correlated. Some of the discrepancy can be accounted for by two important mRNA features: codon usage and mRNA secondary structure. We present a new global factor, called mRNA:ncRNA avoidance, and provide evidence that avoidance increases translational efficiency. We demonstrate a strong selection for the avoidance of stochastic mRNA:ncRNA interactions across prokaryotes, and that these have a greater impact on protein abundance than mRNA structure or codon usage. By generating synonymously variant green fluorescent protein (GFP) mRNAs with different potential for mRNA:ncRNA interactions, we demonstrate that GFP levels correlate well with interaction avoidance. Therefore, taking stochastic mRNA:ncRNA interactions into account enables precise modulation of protein abundance.
Avoidance of stochastic RNA interactions can be harnessed to control protein ...Paul Gardner
Presented at the Computational RNA Biology conference in Hinxton, 17-19th October, 2016.
https://coursesandconferences.wellcomegenomecampus.org/events/item.aspx?e=584
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Leonel Morgado
Current descriptions of immersive learning cases are often difficult or impossible to compare. This is due to a myriad of different options on what details to include, which aspects are relevant, and on the descriptive approaches employed. Also, these aspects often combine very specific details with more general guidelines or indicate intents and rationales without clarifying their implementation. In this paper we provide a method to describe immersive learning cases that is structured to enable comparisons, yet flexible enough to allow researchers and practitioners to decide which aspects to include. This method leverages a taxonomy that classifies educational aspects at three levels (uses, practices, and strategies) and then utilizes two frameworks, the Immersive Learning Brain and the Immersion Cube, to enable a structured description and interpretation of immersive learning cases. The method is then demonstrated on a published immersive learning case on training for wind turbine maintenance using virtual reality. Applying the method results in a structured artifact, the Immersive Learning Case Sheet, that tags the case with its proximal uses, practices, and strategies, and refines the free text case description to ensure that matching details are included. This contribution is thus a case description method in support of future comparative research of immersive learning cases. We then discuss how the resulting description and interpretation can be leveraged to change immersion learning cases, by enriching them (considering low-effort changes or additions) or innovating (exploring more challenging avenues of transformation). The method holds significant promise to support better-grounded research in immersive learning.
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
The technology uses reclaimed CO₂ as the dyeing medium in a closed loop process. When pressurized, CO₂ becomes supercritical (SC-CO₂). In this state CO₂ has a very high solvent power, allowing the dye to dissolve easily.
Or: Beyond linear.
Abstract: Equivariant neural networks are neural networks that incorporate symmetries. The nonlinear activation functions in these networks result in interesting nonlinear equivariant maps between simple representations, and motivate the key player of this talk: piecewise linear representation theory.
Disclaimer: No one is perfect, so please mind that there might be mistakes and typos.
dtubbenhauer@gmail.com
Corrected slides: dtubbenhauer.com/talks.html
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...Scintica Instrumentation
Targeting Hsp90 and its pathogen Orthologs with Tethered Inhibitors as a Diagnostic and Therapeutic Strategy for cancer and infectious diseases with Dr. Timothy Haystead.
PPT on Direct Seeded Rice presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfSelcen Ozturkcan
Ozturkcan, S., Berndt, A., & Angelakis, A. (2024). Mending clothing to support sustainable fashion. Presented at the 31st Annual Conference by the Consortium for International Marketing Research (CIMaR), 10-13 Jun 2024, University of Gävle, Sweden.
2. Medical genomics
Vicky Cameron & Anna Pilbrow at Otago are
identifying genetic variation and genes associated
with an increased risk of heart disease.
Mike Stratton at the Sanger Institute is hunting
for genetic variation that is associated with an
increased risk of cancer.
Rob Knight at UC Boulder is sequencing the
microbes that live on us. Finding associations
between our health and microbial communities.
See Rob’s TEDTalk.
Paul Gardner Genome annotation
3. Agricultural genomics
Graeme Attwood at AgResearch is trying to stop
cows & sheep from emitting greenhouse gases by
studying their gut microbes. He has sequenced
two methanogenic Archaeal genomes of
Methanobrevibacter sp.
Honour McCann at Massey University is trying
to determine how Pseudomonas syringae pv.
actinidiae (PSA) is killing kiwifruit.
Rebecca Ganley at SCION is investigating how
Phytophthora Taxon Agathis (PTA) is causing
kauri die-back disease and killing kauri trees.
Paul Gardner Genome annotation
4. Academic interest genomics
Tom Gilbert at the University of Copenhagen is
sequencing bird and giant squid genomes.
Elizabeth Murchison is sequencing tasmanian
devils (and their transmissible cancers). See
Liz’s TEDTalk.
Neil Gemmel at Otago University is sequencing
the tuatara genome.
Paul Gardner Genome annotation
6. Discussion
How should these researchers annotate their genomes (after
they have sequenced and assembled them)?
What are the fast and cheap methods?
What are the most accurate methods?
Paul Gardner Genome annotation
7. The data tsunami
Thanks to new sequencing technologies (recall Ant’s
teeny-tiny little sequencer).
Biologists no longer spend years acquiring data.
The bottle-neck for research is now in the analysis phase of
research.
Biologists with good mathematics skills and mathematicians
with an interest in biology are in high demand.
Gather data
Analyze-Classify
Hypotheses-
Predictions
Experiment GCGAGCAGACGCA
CCGAACAGACACA
GUGAGCAGGCGCC
CCGAGCAGUCAUA
ACACUGAGACGCA
GCGAGCGU-AACG
R
A
A
A
A
R
C
Y
Y R
R
G
Y
U
U
U
U
U
U U5'
0.0
1.0
2.0
A
C
GU
CC
A
GA5
A
GA
U
CAGG
U
A10
CA
GU
CU
G
A
Paul Gardner Genome annotation
8. We can use sequence analysis...
Genes leave a statistical signal in the genome...
Example: identify promotors, ribosome binding sites,
open-reading frames (ORFs), terminators
In eukaryotes CpG islands, splicing signals and poly-A tails may
be incorporated
How reliable are these approaches? What are the main
weaknesses & strengths?
Figure from: http://zerocool.is-a-geek.net/?p=630
Paul Gardner Genome annotation
9. Sequence analysis: strengths and weaknesses
ORF prediction: Prodigal, GLIMMER
Strengths:
very fast
cheap
Weaknesses:
false positives (see AntiFam)
misses short peptides (e.g. toxins-antitoxin systems)
No ncRNAs, pseudogenes, recoding elements, ...
Paul Gardner Genome annotation
11. We can use homology...
Evolution tends to preserve functional genomic regions...
Example 1: Use an existing set of genes from related species
and map these onto your genome (e.g. RATT)
Example 2: Align two or more related genomes, look for
conserved regions, patterns of variation can be indicative of
function (e.g. QRNA, RNAz & RNAcode)
How reliable are these approaches? What are the main
weaknesses & strengths?
Paul Gardner Genome annotation
12. The QRNA approach...
Rivas et al. (2001) Computational identification of noncoding RNAs in E. coli by comparative genomics. Current
Biology.
Paul Gardner Genome annotation
13. DNA encodes Protein
# STOCKHOLM 1.0
#33 unique RNA sequences, 1 peptide sequence
#=GR PR1 G..A..D..V..T..H..P..P..A..G..D..
#=GR PR3 GlyAlaAspValThrHisProProAlaGlyAsp
platypus GGAGCAGACGTCACTCACCCCCCAGCCGGAGAT
opossum GGAGCAGATGTTACTCACCCTCCTGCTGGAGAT
sloth GGAGCAGACGTCACACACCCTCCCGCGGGGGAT
armadillo GGAGCAGACGTCACGCACCCTCCGGCAGGGGAT
tenrec GGGGCCGACGTCACGCACCCCCCTGCGGGCGAT
elephant GGAGCGGATGTCACACACCCGCCTGCGGGGGAT
shrew GGCGCAGATGTCACGCATCCTCCAGCAGGGGAC
hedgehog GGAGCAGATGTCACACACCCCCCAGCAGGAGAT
megabat GGAGCAGATGTCACACACCCTCCTGCAGGAGAT
microbat GGAGCAGATGTCACCCACCCCCCTGCAGGGGAC
dog GGAGCGGATGTCACACACCCCCCAGCCGGGGAC
cat GGAGCCGATGTCACGCACCCCCCAGCAGGGGAT
horse GGAGCGGATGTCACACACCCTCCGGCAGGGGAT
pika GGAGCAGATGTCACTCACCCTCCAGCTGGGGAT
rabbit GGTGCAGATGTCACACACCCCCCAGCTGGAGAT
squirrel GGAGCAGATGTCACTCACCCTCCAGCGGGAGAT
guinea_pig GGAGCAGATGTCACACACCCACCAGCGGGAGAT
mouse GGAGCAGATGTCACTCATCCGCCTGCTGGGGAC
rat GGAGCAGATGTCACTCATCCACCTGCTGGGGAT
kangaroo_rat GGAGCAGATGTTACACACCCTCCAGCAGGGGAT
tree_shrew GGCGCAGACGTCACGCACCCCCCGGCCGGGGAT
human GGAGCGGATGTCACACACCCCCCAGCAGGGGAT
tarsier GGTGCTGATGTCACACACCCCCCTGCAGGGGAT
marmoset GGAGCAGATGTCACACACCCACCAGCAGGGGAT
zebrafinch GGAGCAGATGTCACTCACCCTCCCGCCGGGGAT
green_anole GGGGCAGACGTCACTCACCCGCCAGCCGGGGAC
xenopus GGAGCAGATGTTACACACCCACCTGCTGGTGAT
pufferfish GGTGCGGATGTTACTCATCCTCCTGCTGGTGAT
fugu GGGGCTGATGTTACTCACCCTCCAGCTGGTGAT
stickleback GGTGCAGACGTCACACATCCTCCAGCGGGTGAT
medaka GGTGCCGATGTCACTCATCCTCCTGCCGGGGAC
zebrafish GGGGCAGATGTTACACACCCGCCGGCTGGTGAT
lamprey GGTGCCGATGTGACACACCCTCCAGCGGGAGAC
//
G
A
A
A
A
A
G
G
G
G
C
C
C
C
U
U
U
U
UC AG UCA
G
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
UC
AGUCAGUCAGUC
AG
U
C
A
G
U
C
A
G
U
C
A
G
U
C
AG
U
C
AG UCAG
P
S
U
nG
nG
oG
oG
oG
G
P
P
P
P
P
nM
nM
M
M
nM
nM
nM
Phenylalanine
Phe
Leucine
Leu
Leucine
Leu
Proline
Pro
Histidine
His
Glutamine
Gln
Isoleucine
Ile
Methionine
Met
Threonine
Thr
Asparagine
Asn
Lysine
Lys
Arginine
Arg
Arginine
Arg
Valine
Val
Alanine
Ala
Glutamic acid
Glu
Aspartic acid
Asp
Glycine
Gly
Serine
Ser
Serine
Ser
Tyrosine
Tyr
Cysteine
Cys
Tryptophan
Trp
Stops
Stop
E
G F L
S
S
Y
C
W
L
P
H
R
R
Q
IM
T
N
K
V
A
D
89.09
75.07
174.20
174.20
146.19
165.19
133.11
117.15
147.13
146.15
155.16
115.13
105.09
105.09
131.18
132.12
MW
=149.21Da
131.18
119.12
204.23
131.18
181.19
121.16
HN
NH2
NH
H2N
OH
O
H2N
CH3 OH
O
H2N
O
H2N
OH
O
O
HO
H2N
OH
O
HS
H2N
OH
O
H2N
O
NH2
OH
O
O
OH
H2N
OH
O
H2N
OH
O
NH
H2N
OH
O
N
CH3 CH3
H2N
OH
O
CH3
CH3
H2N
OH
O
CH3
CH3
H2N
OH
O
H2N
H2N
OH
O
CH3 S
H2N
OH
O
H2N
OH
O
NH
OH
O
H2N
HO OH
O
H2N
HO OH
O
H2N
HO
CH3
OH
O
NH
H2N
OH
O
HO
H2N
OH
O
H2N
CH3
CH3
OH
O
Basic
Acidic
Polar
Nonpolar
(hydrophobic)
S -
M -
P -
U -
nM -
oG -
nG -
Sumo
Methyl
Phospho
Ubiquitin
N-Methyl
O-glycosyl
N-glycosyl
Modification
aminoacid
2nd1st position 3rd
U
C
Image source: http://upload.wikimedia.org/wikipedia/en/d/d6/GeneticCode21-version-2.svgPaul Gardner Genome annotation
14. DNA encodes RNA
G
C
G
G
A
U
UU
A
GCUC
AGD
D
G
G G A
G A G C
G
C
C
A
GA
C
U
G
A A
.
A
.
C
U
G
GAGG
U
C
C U G U G
T . C
G
A
UC
CACAG
A
A
U
U
C
G
C
A
C
CA
Variable
LoopAnticodon
Loop
T ΨC
Loop
10 15 20 25 30 355 40 45 50 55 60 65 70 75
Anticodon
Loop
Acceptor
Stem
GCGGAUUUAGCUCAGDDGGGAGAGCGCCAGACUGAAYA.CUGGAGGUCCUGUGT.CGAUCCACAGAAUUCGCACCA5’ 3’
Secondary Structure Tertiary StructureB C
Primary StructureA
Acceptor
Stem
T ΨC
Loop
ΨΨ
Ψ
Ψ
Y
65
60
55
40
10
20
15
5
70
75
25
30
35
45
50
D Loop
3’
5’
5’
3’
D Loop
Paul Gardner Genome annotation
15. Homology-based annotation: strengths and weaknesses
Example 1: map known genes onto genomes
Strengths: fast, cheap, ...
Weaknesses:
Inaccurate for divergent species (e.g. Graeme’s
Methanobrevibacter or GEBA genomes)
Requires manual correction of border-line results
Errors are propagated throughout the databases
Example 2: aligning genomes
Strengths:
“cheap” if genomes already exist
fast for small genomes
evolutionary support for all discoveries
Weaknesses:
Requires lots of powerful computers for large genomes
Inaccurate for divergent species (e.g. Neil’s tuatara or
Graeme’s Methanobrevibacter)
Requires manual correction of border-line results
Paul Gardner Genome annotation
16. Homology annotation: nucleotides are difficult to align
0
20
40
60
80
100
Conservation of Xfam families in bacterial genomes
Conservedfamilies(%)
Freq.
RNA−seq species
0
10
Pfam (N=6671)
Rfam (N=331)
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Phylogenetic distance
Lindgreen et al. (2014) Robust identification of noncoding RNA from transcriptomes requires
phylogenetically-informed sampling. PLOS Computational Biology.
Paul Gardner Genome annotation
17. We can use RNA detection methods...
Remember the central dogma of molecular biology
Example: sequence RNAs from multiple tissues,
developmental stages and environmental conditions
How reliable is this approach? What are the main weaknesses
& strengths?
Wang, Gerstein & Snyder (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics.
Paul Gardner Genome annotation
18. RNA-seq: strengths and weaknesses
RNA-seq
Strengths:
Experimental support for transcribed regions
Identifies untranslated regions (UTRs), ncRNAs, antisense
RNAs, ...
Identifies alternatively spliced and edited RNAs
Weaknesses:
Expensive & lots of work
RNA degradation and genomic contamination
Transcription does not prove translation
Will miss genes transcribed in specific developmental stages,
tissues & environmental conditions E.g. lsy-6 microRNA
Paul Gardner Genome annotation
19. We can use protein detection methods...
Central dogma of molecular biology
Example: Protein mass spectrometry
How reliable is this approach? What are the main weaknesses
& strengths?
Figure from: http://en.wikipedia.org/wiki/Protein mass spectrometry
Paul Gardner Genome annotation
20. Protein mass spectrometry: strengths and weaknesses
Protein mass spectrometry
Strengths:
Experimental support for translated regions
Identifies alternative isoforms and post-translational
modifications (Ezkurdia et al. 2012)
Weaknesses:
Expensive & lots of work
Misses genes transcribed in specific developmental stages,
tissues & environmental conditions
Currently technology generally only detects the most
abundant proteins
Ezkurdia et al. (2012) Comparative proteomics reveals a significant bias toward alternative protein isoforms with
conserved structure and function. Mol Biol Evol.
Paul Gardner Genome annotation
21. How cool is this?!
Schwanh¨ausser et al. (2011) Global quantification of mammalian gene expression control. Nature
Paul Gardner Genome annotation
22. This is also kinda neat...
Lu et al. (2007) Absolute protein expression profiling estimates the relative contributions of transcriptional and
translational regulation. Nature Biotechnology
Paul Gardner Genome annotation
23. Relevant reading
Reviews:
Stein L (2001) Genome annotation: from sequence to biology.
Nature Reviews Genetics.
Reed JL et al. (2006) Towards multidimensional genome
annotation. Nature Reviews Genetics.
ORF finding:
Delcher AL et al. (2007) Identifying bacterial genes and
endosymbiont DNA with Glimmer. Bioinformatics.
Hyatt D et al. (2010) Prodigal: prokaryotic gene recognition
and translation initiation site identification. BMC
Bioinformatics.
RNA-seq (Ant’s lectures)
Wang, Gerstein & Snyder (2009) RNA-Seq: a revolutionary
tool for transcriptomics. Nature Reviews Genetics.
Proteomics (Sarah’s lectures)
Ezkurdia et al. (2012) Comparative proteomics reveals a
significant bias toward alternative protein isoforms with
conserved structure and function. Mol Biol Evol.
Paul Gardner Genome annotation
24. Homework: How to make a sequence alignment?
Play: http://phylo.cs.mcgill.ca
or even better, play Ribo: http://ribo.cs.mcgill.ca/
Paul Gardner Genome annotation