This document summarizes an automated phylogenetic tree-based small subunit rRNA taxonomy and alignment pipeline (STAP) developed by the authors. STAP generates high-quality multiple sequence alignments and phylogenetic trees from rRNA gene sequences in a fully automated manner, allowing for phylogenetic analysis of large datasets. It combines existing tools like BLAST, CLUSTALW and PHYML with new programs for automated alignment, masking, and tree parsing. STAP yields results comparable to manual analysis but with increased speed and capacity needed to analyze the large volumes of rRNA data now being generated.
During her summer internship at Knome Inc., Neha Gupta worked on multiple bioinformatics projects including interpreting the effects of genetic variants using Condel scores and performing pedigree analysis and estimating inbreeding using PLINK. She wrote scripts, researched tools like MAPP, and analyzed outputs to evaluate Condel scores. Future work includes integrating additional tools into Condel scoring and automating MAPP for whole genome analysis. She gained valuable experience working independently and as part of a team on challenging projects.
Abstract: The focus in this session will be put on the differences between standard DNA mapping and RNAseq-specific transcript mapping: identifying splice variants and isoforms. The issue of transcript quantification and genomic variants that can be identified from RNAseq data will be discussed.
RNASeq DE methods review Applied Bioinformatics Journal ClubJennifer Shelton
This document summarizes a journal club discussion on comparing commonly used differential expression software packages using two benchmark datasets. It describes the focus of comparing normalization methods, sensitivity and specificity of differential expression detection, and the impact of sequencing depth and replication. The document then provides details on the normalization and statistical modeling approaches used by different packages, including DESeq, edgeR, Cuffdiff, baySeq, PoissonSeq and limma. It concludes by outlining the results presented on normalization performance, differential expression analysis, and how factors like replication and sequencing depth influence detection of differentially expressed genes.
Sequence and Structural Databases of DNA and Protein, and its significance in...SBituila
This document discusses various DNA and protein sequence and structural databases, including their history, roles, and available tools. Some of the key databases mentioned are NCBI, EMBL, DDBJ, GenBank, UniProt, and PDB. NCBI maintains large public nucleotide and protein databases and provides analysis tools. EMBL collects and distributes sequence data. PDB is a database for 3D structural data of biomolecules. Together, these databases provide essential resources for genomic and proteomic research.
Metagenome is the entire genetic information of microorganism at specific site/time. Analysis of metagenomic data could be achieved by two approaches; 1) amplicon (16s RNA gene) data analysis and whole genome metagenomics data analysis. Here we focus on 16S rRNA amplicon using Mothur Pipeline for analysis of metagenomics data.
This document outlines the course content for a bioinformatics course covering 4 units:
Unit 1 introduces basic concepts of bioinformatics including proteins, DNA, RNA, and sequence, structure, and function.
Unit 2 covers major bioinformatics databases including those for nucleotide sequences, protein sequences, sequence motifs, protein structures, and other relevant databases.
Unit 3 discusses topics like single and pairwise sequence alignment, scoring matrices, and multiple sequence alignments.
Unit 4 covers the human genome project, gene and genomic databases, genomic data mining, and microarray techniques.
This document summarizes the development and application of using genetic segregation patterns in a large family to establish a "ground truth" set of variants for evaluating variant calling pipelines. The author analyzed whole genome sequencing data from a 14-person CEPH pedigree, identifying over 680 recombination crossovers to phase variants into maternal and paternal haplotypes. Over 99% of called variants were found to segregate in a manner consistent with genetic inheritance, establishing this set as a high-confidence "ground truth" for variant calling assessment. The analysis also identified areas for improvement in structural variant and small indel calling.
During her summer internship at Knome Inc., Neha Gupta worked on multiple bioinformatics projects including interpreting the effects of genetic variants using Condel scores and performing pedigree analysis and estimating inbreeding using PLINK. She wrote scripts, researched tools like MAPP, and analyzed outputs to evaluate Condel scores. Future work includes integrating additional tools into Condel scoring and automating MAPP for whole genome analysis. She gained valuable experience working independently and as part of a team on challenging projects.
Abstract: The focus in this session will be put on the differences between standard DNA mapping and RNAseq-specific transcript mapping: identifying splice variants and isoforms. The issue of transcript quantification and genomic variants that can be identified from RNAseq data will be discussed.
RNASeq DE methods review Applied Bioinformatics Journal ClubJennifer Shelton
This document summarizes a journal club discussion on comparing commonly used differential expression software packages using two benchmark datasets. It describes the focus of comparing normalization methods, sensitivity and specificity of differential expression detection, and the impact of sequencing depth and replication. The document then provides details on the normalization and statistical modeling approaches used by different packages, including DESeq, edgeR, Cuffdiff, baySeq, PoissonSeq and limma. It concludes by outlining the results presented on normalization performance, differential expression analysis, and how factors like replication and sequencing depth influence detection of differentially expressed genes.
Sequence and Structural Databases of DNA and Protein, and its significance in...SBituila
This document discusses various DNA and protein sequence and structural databases, including their history, roles, and available tools. Some of the key databases mentioned are NCBI, EMBL, DDBJ, GenBank, UniProt, and PDB. NCBI maintains large public nucleotide and protein databases and provides analysis tools. EMBL collects and distributes sequence data. PDB is a database for 3D structural data of biomolecules. Together, these databases provide essential resources for genomic and proteomic research.
Metagenome is the entire genetic information of microorganism at specific site/time. Analysis of metagenomic data could be achieved by two approaches; 1) amplicon (16s RNA gene) data analysis and whole genome metagenomics data analysis. Here we focus on 16S rRNA amplicon using Mothur Pipeline for analysis of metagenomics data.
This document outlines the course content for a bioinformatics course covering 4 units:
Unit 1 introduces basic concepts of bioinformatics including proteins, DNA, RNA, and sequence, structure, and function.
Unit 2 covers major bioinformatics databases including those for nucleotide sequences, protein sequences, sequence motifs, protein structures, and other relevant databases.
Unit 3 discusses topics like single and pairwise sequence alignment, scoring matrices, and multiple sequence alignments.
Unit 4 covers the human genome project, gene and genomic databases, genomic data mining, and microarray techniques.
This document summarizes the development and application of using genetic segregation patterns in a large family to establish a "ground truth" set of variants for evaluating variant calling pipelines. The author analyzed whole genome sequencing data from a 14-person CEPH pedigree, identifying over 680 recombination crossovers to phase variants into maternal and paternal haplotypes. Over 99% of called variants were found to segregate in a manner consistent with genetic inheritance, establishing this set as a high-confidence "ground truth" for variant calling assessment. The analysis also identified areas for improvement in structural variant and small indel calling.
The document discusses Prosite, a database of protein family signatures that can be used to determine the function of uncharacterized proteins. It contains patterns and profiles formulated to identify which known protein family a new sequence belongs to. The Prosite database consists of two files - a data file containing information for scanning sequences, and a documentation file describing each pattern and profile. New Prosite entries are mainly profiles developed by collaborators at the SIB Swiss Institute of Bioinformatics to identify distantly related proteins based on conserved residues.
The document discusses various types of biological databases. It describes primary databases that contain original data, secondary databases that contain processed data derived from primary databases, and composite databases that collect and filter data from multiple primary databases. Examples of specific biological databases are provided, including nucleic acid databases like GenBank, protein sequence databases like Swiss-Prot, protein structure database PDB, and metabolic pathway database KEGG. Details about the purpose and features of some of these major databases like GenBank, DDBJ, EMBL, Swiss-Prot, and PDB are outlined in the document.
Standarization in Proteomics: From raw data to metadata filesYasset Perez-Riverol
The document discusses various mass spectrometry file formats used in proteomics workflows, including the advantages of XML-based formats like mzML and mzIdentML that support metadata and can be read by different software. It also describes challenges with proprietary binary formats and efforts to develop common data standards and APIs through projects like ProteoWizard, PRIDE, and the ms-core-api library. Standard file formats are important for sharing and reusing proteomics data over time as instrumentation and software evolve.
S.Prasanth Kumar is a bioinformatician who studies proteomics, 2D-PAGE, and proteome databases. Proteomics involves the study of proteins expressed by a genome through analysis of protein sequences, structures, modifications, and interactions. Major databases include Swiss-Prot, which contains annotated protein sequences, and TrEMBL, which contains automatically generated sequences. Other databases contain information on protein families and domains, nucleotide sequences, 2D-PAGE gel images, and post-translational modifications.
Ontologies for life sciences: examples from the gene ontologyMelanie Courtot
Ontologies for life sciences: examples from the Gene Ontology
The document discusses ontologies for life sciences, using the Gene Ontology (GO) as an example. It provides an overview of GO, describing it as a way to capture biological knowledge for gene products in a written and computable form using a set of concepts and relationships arranged hierarchically. GO allows consistent descriptions of genes/gene products across databases. Model organism databases provide annotations connecting genes to GO terms. The GO is a collaborative effort to address the need for consistent descriptions of genes.
Genomic databases are referred to as online repositories of genomic variants, described for a single (locus-specific) or more (general) genes or specifically for a population or ethnic group (national/ethnic).
Introduction to 16S Analysis with NGS - BMR GenomicsAndrea Telatin
This document provides an overview and primer on 16S amplicon sequencing and analysis for metagenomics. It discusses how 16S is a ubiquitous gene that can be used to compare microbial communities across samples, outlines common analysis steps like preprocessing, OTU picking, taxonomy assignment, and diversity metrics, and introduces two analysis tools - MEGAN and Qiime. Key advantages and limitations of the 16S amplicon approach are highlighted.
PomBase conventions for improving annotation depth, breadth, consistency and ...Valerie Wood
PomBase uses a combination of annotation conventions and QC mechanisms. In addition to identifying annotation inconsistencies and errors, these combined methods improve information content, annotation coverage, depth or specificity and redundancy.
The National Center for Biotechnology Information (NCBI) was created in 1988 as part of the National Library of Medicine at NIH. It establishes public databases for biological research, develops software tools for sequence analysis, and disseminates biomedical information from its location in Bethesda, MD. NCBI houses several integrated databases including PubMed, GenBank, RefSeq, and UniGene that contain literature, sequences, gene information, and more.
Variant (SNPs/Indels) calling in DNA sequences, Part 2Denis C. Bauer
Abstract: This session will focus on the steps involved in identifying genomic variants after an initial mapping was achieved: improvement the mapping, SNP and indel calling and variant filtering/recalibration will be introduced.
Introduction to 16S rRNA gene multivariate analysisJosh Neufeld
Short introductory talk on multivariate statistics for 16S rRNA gene analysis given at the 2nd Soil Metagenomics conference in Braunschweig Germany, December 2013. A previous talk had discussed quality filtering, chimera detection, and clustering algorithms.
The document provides information about RNA-seq analysis using R and Bioconductor. It begins with an introduction to the BCBB branch and its services assisting researchers with bioinformatics and computational projects. The document then discusses RNA-seq, R, and Bioconductor individually before explaining how they can be used together for RNA-seq analysis. Step-by-step tutorials and resources are provided for differential expression analysis and other tasks using R packages like DESeq2.
Talk by J. Eisen for NZ Computational Genomics meetingJonathan Eisen
This document discusses phylogeny-driven approaches to studying microbial diversity using ribosomal RNA gene sequences. It provides background on how advances in sequencing technology and appreciation of microbial diversity have enabled microbiome research. The document outlines several uses of phylogeny in microbiome studies, including constructing species phylogenies using rRNA sequences and assigning taxonomy to environmental sequences via rRNA phylotyping. It describes challenges with analyzing large rRNA datasets and introduces an automated pipeline called STAP that generates high-quality multiple sequence alignments and phylogenetic trees to classify sequences and analyze species diversity in a manner that scales to large datasets.
The document discusses various types of biological databases that contain information on amino acids, nucleic acids, proteins and their structures. It describes primary databases that contain sequence data as well as secondary and tertiary databases that include protein structure information like motifs, domains and atomic coordinates derived from techniques like X-ray crystallography. Major databases discussed include Swiss-Prot, PDB, EMBL and their roles in archiving sequence data, annotating proteins and classifying structural information using systems like SCOP and CATH.
This document discusses protein sequence databases and their role in storing protein data generated from genome projects and new proteomics technologies. It describes several types of protein databases, including universal repositories like GenPept that store sequences with little annotation, and expertly curated databases like Swiss-Prot that enrich sequence data with additional validation and integration. Specialized databases also exist that focus on specific protein families, organisms, structures like SCOP, or classifications like CATH.
This document discusses major biological databases. It describes three types of biological databases: primary databases that contain original experimental data, secondary databases that contain additional derived information from primary databases, and composite databases that combine data from multiple sources. The document focuses on describing GenBank, a primary sequence database maintained by the National Center for Biotechnology Information. It provides details on how sequences are submitted to GenBank and how entries are formatted, including information contained in various fields like LOCUS, DEFINITION, and FEATURES. The document also briefly introduces the European Molecular Biology Laboratory database, EMBL, which collaborates with GenBank and DDBJ to exchange nucleotide sequence data daily.
BITs: Genome browsers and interpretation of gene lists.BITS
Module 5 Genome browsers and interpreting gene lists.
Part of training session "Basic Bioinformatics concepts, databases and tools" - http://www.bits.vib.be/training
The National Center for Biotechnology Information (NCBI) was established in 1988 as part of the National Library of Medicine. NCBI houses numerous biomedical databases including those related to genes, proteins, molecular structures, gene expression, and biomedical literature. Users can utilize various tools on the NCBI site to search databases, perform sequence alignments using BLAST, and submit new sequences. Some key databases include GenBank (nucleotide sequences), PubMed (biomedical literature), and RefSeq (non-redundant reference sequences).
The document discusses bioinformatics tools used for analyzing biological data. It begins with an introduction to bioinformatics and then describes several categories of tools: biological databases for storing genomic and protein data; homology tools for sequence alignment and comparison; protein function analysis tools; structural analysis tools; and sequence manipulation and analysis tools. Common tools discussed include BLAST, FASTA, ClustalW, and databases like GenBank. The document concludes by covering applications of bioinformatics in areas like molecular modeling, medicine, and computation.
The document discusses Prosite, a database of protein family signatures that can be used to determine the function of uncharacterized proteins. It contains patterns and profiles formulated to identify which known protein family a new sequence belongs to. The Prosite database consists of two files - a data file containing information for scanning sequences, and a documentation file describing each pattern and profile. New Prosite entries are mainly profiles developed by collaborators at the SIB Swiss Institute of Bioinformatics to identify distantly related proteins based on conserved residues.
The document discusses various types of biological databases. It describes primary databases that contain original data, secondary databases that contain processed data derived from primary databases, and composite databases that collect and filter data from multiple primary databases. Examples of specific biological databases are provided, including nucleic acid databases like GenBank, protein sequence databases like Swiss-Prot, protein structure database PDB, and metabolic pathway database KEGG. Details about the purpose and features of some of these major databases like GenBank, DDBJ, EMBL, Swiss-Prot, and PDB are outlined in the document.
Standarization in Proteomics: From raw data to metadata filesYasset Perez-Riverol
The document discusses various mass spectrometry file formats used in proteomics workflows, including the advantages of XML-based formats like mzML and mzIdentML that support metadata and can be read by different software. It also describes challenges with proprietary binary formats and efforts to develop common data standards and APIs through projects like ProteoWizard, PRIDE, and the ms-core-api library. Standard file formats are important for sharing and reusing proteomics data over time as instrumentation and software evolve.
S.Prasanth Kumar is a bioinformatician who studies proteomics, 2D-PAGE, and proteome databases. Proteomics involves the study of proteins expressed by a genome through analysis of protein sequences, structures, modifications, and interactions. Major databases include Swiss-Prot, which contains annotated protein sequences, and TrEMBL, which contains automatically generated sequences. Other databases contain information on protein families and domains, nucleotide sequences, 2D-PAGE gel images, and post-translational modifications.
Ontologies for life sciences: examples from the gene ontologyMelanie Courtot
Ontologies for life sciences: examples from the Gene Ontology
The document discusses ontologies for life sciences, using the Gene Ontology (GO) as an example. It provides an overview of GO, describing it as a way to capture biological knowledge for gene products in a written and computable form using a set of concepts and relationships arranged hierarchically. GO allows consistent descriptions of genes/gene products across databases. Model organism databases provide annotations connecting genes to GO terms. The GO is a collaborative effort to address the need for consistent descriptions of genes.
Genomic databases are referred to as online repositories of genomic variants, described for a single (locus-specific) or more (general) genes or specifically for a population or ethnic group (national/ethnic).
Introduction to 16S Analysis with NGS - BMR GenomicsAndrea Telatin
This document provides an overview and primer on 16S amplicon sequencing and analysis for metagenomics. It discusses how 16S is a ubiquitous gene that can be used to compare microbial communities across samples, outlines common analysis steps like preprocessing, OTU picking, taxonomy assignment, and diversity metrics, and introduces two analysis tools - MEGAN and Qiime. Key advantages and limitations of the 16S amplicon approach are highlighted.
PomBase conventions for improving annotation depth, breadth, consistency and ...Valerie Wood
PomBase uses a combination of annotation conventions and QC mechanisms. In addition to identifying annotation inconsistencies and errors, these combined methods improve information content, annotation coverage, depth or specificity and redundancy.
The National Center for Biotechnology Information (NCBI) was created in 1988 as part of the National Library of Medicine at NIH. It establishes public databases for biological research, develops software tools for sequence analysis, and disseminates biomedical information from its location in Bethesda, MD. NCBI houses several integrated databases including PubMed, GenBank, RefSeq, and UniGene that contain literature, sequences, gene information, and more.
Variant (SNPs/Indels) calling in DNA sequences, Part 2Denis C. Bauer
Abstract: This session will focus on the steps involved in identifying genomic variants after an initial mapping was achieved: improvement the mapping, SNP and indel calling and variant filtering/recalibration will be introduced.
Introduction to 16S rRNA gene multivariate analysisJosh Neufeld
Short introductory talk on multivariate statistics for 16S rRNA gene analysis given at the 2nd Soil Metagenomics conference in Braunschweig Germany, December 2013. A previous talk had discussed quality filtering, chimera detection, and clustering algorithms.
The document provides information about RNA-seq analysis using R and Bioconductor. It begins with an introduction to the BCBB branch and its services assisting researchers with bioinformatics and computational projects. The document then discusses RNA-seq, R, and Bioconductor individually before explaining how they can be used together for RNA-seq analysis. Step-by-step tutorials and resources are provided for differential expression analysis and other tasks using R packages like DESeq2.
Talk by J. Eisen for NZ Computational Genomics meetingJonathan Eisen
This document discusses phylogeny-driven approaches to studying microbial diversity using ribosomal RNA gene sequences. It provides background on how advances in sequencing technology and appreciation of microbial diversity have enabled microbiome research. The document outlines several uses of phylogeny in microbiome studies, including constructing species phylogenies using rRNA sequences and assigning taxonomy to environmental sequences via rRNA phylotyping. It describes challenges with analyzing large rRNA datasets and introduces an automated pipeline called STAP that generates high-quality multiple sequence alignments and phylogenetic trees to classify sequences and analyze species diversity in a manner that scales to large datasets.
The document discusses various types of biological databases that contain information on amino acids, nucleic acids, proteins and their structures. It describes primary databases that contain sequence data as well as secondary and tertiary databases that include protein structure information like motifs, domains and atomic coordinates derived from techniques like X-ray crystallography. Major databases discussed include Swiss-Prot, PDB, EMBL and their roles in archiving sequence data, annotating proteins and classifying structural information using systems like SCOP and CATH.
This document discusses protein sequence databases and their role in storing protein data generated from genome projects and new proteomics technologies. It describes several types of protein databases, including universal repositories like GenPept that store sequences with little annotation, and expertly curated databases like Swiss-Prot that enrich sequence data with additional validation and integration. Specialized databases also exist that focus on specific protein families, organisms, structures like SCOP, or classifications like CATH.
This document discusses major biological databases. It describes three types of biological databases: primary databases that contain original experimental data, secondary databases that contain additional derived information from primary databases, and composite databases that combine data from multiple sources. The document focuses on describing GenBank, a primary sequence database maintained by the National Center for Biotechnology Information. It provides details on how sequences are submitted to GenBank and how entries are formatted, including information contained in various fields like LOCUS, DEFINITION, and FEATURES. The document also briefly introduces the European Molecular Biology Laboratory database, EMBL, which collaborates with GenBank and DDBJ to exchange nucleotide sequence data daily.
BITs: Genome browsers and interpretation of gene lists.BITS
Module 5 Genome browsers and interpreting gene lists.
Part of training session "Basic Bioinformatics concepts, databases and tools" - http://www.bits.vib.be/training
The National Center for Biotechnology Information (NCBI) was established in 1988 as part of the National Library of Medicine. NCBI houses numerous biomedical databases including those related to genes, proteins, molecular structures, gene expression, and biomedical literature. Users can utilize various tools on the NCBI site to search databases, perform sequence alignments using BLAST, and submit new sequences. Some key databases include GenBank (nucleotide sequences), PubMed (biomedical literature), and RefSeq (non-redundant reference sequences).
The document discusses bioinformatics tools used for analyzing biological data. It begins with an introduction to bioinformatics and then describes several categories of tools: biological databases for storing genomic and protein data; homology tools for sequence alignment and comparison; protein function analysis tools; structural analysis tools; and sequence manipulation and analysis tools. Common tools discussed include BLAST, FASTA, ClustalW, and databases like GenBank. The document concludes by covering applications of bioinformatics in areas like molecular modeling, medicine, and computation.
This paper presents a literature survey conducted for research oriented developments made till. The significance of this paper would be to provide a deep rooted understanding and knowledge transfer regarding existing approaches for gene sequencing and alignments using Smith Waterman algorithms and their respective strengths and weaknesses. In order to develop or perform any quality research it is always advised to conduct research goal oriented literature survey that could facilitate an in depth understanding of research work and an objective can be formulated on the basis of gaps existing between present requirements and existing approaches. Gene sequencing problems are one of the predominant issues for researchers to come up with optimized system model that could facilitate optimum processing and efficiency without introducing overheads in terms of memory and time. This research is oriented towards developing such kind of system while taking into consideration of dynamic programming approach called Smith Waterman algorithm in its enhanced form decorated with other supporting and optimized techniques. This paper provides an introduction oriented knowledge transfer so as to provide a brief introduction of research domain, research gap and motivations, objective formulated and proposed systems to accomplish ultimate objectives.
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...journal ijrtem
process in which instead comparing whole query sequence with database sequence it breaks
query sequence into small words and these words are used to align patterns. it uses heuristic method which
make it faster than earlier smith-waterman algorithm. But due small query sequence used for align in case of
very large database with complex queries it may perform poor. To remove this draw back we suggest by using
MSA tools which can filter database in by removing unnecessary sequences from data. This sorted data set then
applies to BLAST which can then indentify relationship among them i.e. HOMOLOGS, ORTHOLOGS,
PARALOGS. The proposed system can be further use to find relation among two persons or used to create
family tree. Ortholog is interesting for a wide range of bioinformatics analyses, including functional annotation,
phylogenetic inference, or genome evolution. This system describes and motivates the algorithm for predicting
orthologous relationships among complete genomes. The algorithm takes a pairwise approach, thus neither
requiring tree reconstruction nor reconciliation
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...IJRTEMJOURNAL
BLAST is most popular sequence alignment tool used to align bioinformatics patterns. It uses
local alignment process in which instead comparing whole query sequence with database sequence it breaks
query sequence into small words and these words are used to align patterns. it uses heuristic method which
make it faster than earlier smith-waterman algorithm. But due small query sequence used for align in case of
very large database with complex queries it may perform poor. To remove this draw back we suggest by using
MSA tools which can filter database in by removing unnecessary sequences from data. This sorted data set then
applies to BLAST which can then indentify relationship among them i.e. HOMOLOGS, ORTHOLOGS,
PARALOGS. The proposed system can be further use to find relation among two persons or used to create
family tree. Ortholog is interesting for a wide range of bioinformatics analyses, including functional annotation,
phylogenetic inference, or genome evolution. This system describes and motivates the algorithm for predicting
orthologous relationships among complete genomes. The algorithm takes a pairwise approach, thus neither
requiring tree reconstruction nor reconciliation
This document provides an overview of bioinformatics and discusses key concepts like:
- Bioinformatics combines biology, computer science, and information technology to analyze large amounts of biological data.
- High-throughput DNA sequencing has generated vast genomic data that requires bioinformatics tools and databases accessible via the internet to analyze and share.
- Popular sequence alignment tools like BLAST, FASTA, and ClustalW are used to search databases and compare sequences, helping researchers analyze genes and genomes.
The document outlines the basic steps in constructing a phylogenetic tree:
1) Assembling and aligning a dataset of DNA or protein sequences of interest.
2) Using computational methods and evolutionary models to build phylogenetic trees from the sequence alignments.
3) Statistically testing and assessing the estimated trees to evaluate which tree topologies best describe the phylogenetic relationships between the sequences.
The process aims to provide a visual representation of how organisms have evolved from a common ancestor over time based on analyses of genetic similarities and differences in their molecular sequences.
Knowing Your NGS Downstream: Functional PredictionsGolden Helix Inc
Next-Generation Sequencing analysis workflows typically lead to a list of candidate variants that may or may not be associated with the phenotype of interest. Any given analysis may result in tens, hundreds, or even thousands of genetic variants which must be screened and prioritized for experimental validation before a causal variant may be identified. To assist with this screening process, the field of bioinformatics has developed numerous algorithms to predict the functional consequences of genetic variants. Algorithms like SIFT and PolyPhen-2 are firmly established in the field and are cited frequently. Other tools, like MutationAssessor and FATHMM are newer and perhaps not known as well.
This presentation will review several of the functional prediction tools that are currently available to help researchers determine the functional consequences of genetic alterations. The biological principals underlying functional predictions will be discussed together with an overview of the methodology used by each of the predictive algorithms. Finally, we will discuss how these predictions can be accessed and used within the Golden Helix SNP & Variation Suite (SVS) software.
This document provides an introduction and overview of the field of bioinformatics. It discusses how bioinformatics combines computer science and biology to analyze large amounts of biological data. Specifically, it mentions that bioinformatics uses algorithms and techniques from computer science to solve complex biological problems related to areas like molecular biology, genomics, drug discovery, and more. It also outlines some of the key applications of bioinformatics like sequence analysis, protein structure prediction, genome annotation, and comparative genomics. Finally, it provides brief descriptions of important biological databases and resources that bioinformaticians use to store and analyze genomic and protein sequence data.
The document discusses tools for analyzing transcriptome data. It describes FastQC, a tool used for quality control checks on raw sequencing data by generating statistics on base quality, GC content, overrepresented sequences, etc. Scripture is described as a tool for de novo assembly of RNA-seq data that relies on aligned reads and a reference genome to reconstruct transcripts. The document outlines the typical workflow of indexing aligned reads, running quality checks with FastQC, and using Scripture or other tools for reconstruction. Common file formats like FASTQ, SAM, BAM and output formats like BED are also summarized.
Bioinformatic tools analyze biological sequences to find similarities, domains, and coding regions. BLAST is a widely used tool that compares a query sequence to database sequences to find regions of similarity, helping scientists determine sequence function. Sequence alignment identifies similar character patterns between two or more sequences and can provide information about function, structure, and evolutionary relationships. CpG islands are regions of DNA where cytosine and guanine nucleotides frequently occur next to each other. Methylation of cytosines within CpG islands can regulate gene expression and is an epigenetic mechanism studied in cancer diagnosis.
De novo transcriptome assembly of solid sequencing data in cucumis melobioejjournal
As sequencing technologies progress, focus shifts towards solving bioinformatic challenges, of which sequence read assembly is the first task. In the present study, we have carried out a comparison of two assemblers (SeqMan and CLC) for transcriptome assembly, using a new dataset from Cucumis melo. Between two assemblers SeqMan generated an excess of small, redundant contigs where as CLC generated the least redundant assembly. Since different assemblers use different algorithms to build contigs, wefollowed the merging of assemblies by CAP3 and found that the merged assembly is better than individual assemblies and more consistent in the number and size of contigs. Combining the assemblies from different programs gave a more credible final product, and therefore this approach is recommended for quantitative
output.
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELObioejjournal
As sequencing technologies progress, focus shifts towards solving bioinformatic challenges, of which
sequence read assembly is the first task. In the present study, we have carried out a comparison of two
assemblers (SeqMan and CLC) for transcriptome assembly, using a new dataset from Cucumis melo.
Between two assemblers SeqMan generated an excess of small, redundant contigs where as CLC generated
the least redundant assembly. Since different assemblers use different algorithms to build contigs, we
followed the merging of assemblies by CAP3 and found that the merged assembly is better than individual
assemblies and more consistent in the number and size of contigs. Combining the assemblies from different
programs gave a more credible final product, and therefore this approach is recommended for quantitative
output
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELObioejjournal
As sequencing technologies progress, focus shifts towards solving bioinformatic challenges, of which sequence read assembly is the first task. In the present study, we have carried out a comparison of two assemblers (SeqMan and CLC) for transcriptome assembly, using a new dataset from Cucumis melo. Between two assemblers SeqMan generated an excess of small, redundant contigs where as CLC generated
the least redundant assembly. Since different assemblers use different algorithms to build contigs, we followed the merging of assemblies by CAP3 and found that the merged assembly is better than individual assemblies and more consistent in the number and size of contigs. Combining the assemblies from different programs gave a more credible final product, and therefore this approach is recommended for quantitative
output.
This document discusses computational methods and challenges for genome assembly using next-generation sequencing data. It describes the four main stages of genome assembly as preprocessing filtering, graph construction, graph simplification, and postprocessing filtering. Each stage processes the data from the previous stage to build the assembly graph and reduce complexity, though some assemblers delay filtering steps.
This document summarizes computational analysis methods for determining expectation values commonly used in bioinformatics databases. It discusses tools like BLAST, FASTA, and databases like NCBI that allow querying and analyzing sequences. The expectation value provides the probability that a match could occur by chance, with lower values indicating higher quality matches. These tools and databases facilitate customizable extraction of data from sequences to enable further analysis and knowledge discovery in bioinformatics.
This document discusses various bioinformatics tools and their functions. It provides details on multiple sequence alignment tools like CLUSTAL Omega, CLUSTALW, BLAST, and FASTA. It explains that CLUSTAL Omega can align a large number of sequences quickly and accurately using progressive alignment. CLUSTALW performs multiple sequence alignment in three steps - pairwise alignment, guide tree creation, and multiple alignment using the guide tree. BLAST can identify unknown sequences by comparing them to known sequences. FASTA uses short exact matches to find similar regions between sequences. Expasy provides access to databases for proteomics, genomics, and other areas. MASCOT searches peptide mass fingerprinting and shotgun proteomics datasets.
Using VarSeq to Improve Variant Analysis Research WorkflowsDelaina Hawkins
Many questions must be answered when analyzing DNA sequence variants: How do I determine which variants are potentially deleterious? Is the sequencing quality sufficient? How do I prioritize the results? Which annotation sources may help answer my research question?
In this webinar presentation, we will review workflow strategies for quality control and analysis of DNA sequence variants using the VarSeq software package from Golden Helix. VarSeq is a powerful platform for analysis of DNA sequence variants in clinical and translational research settings. VarSeq provides researchers with easy access to curated public databases of variant annotation information, and also enables users to incorporate their own local databases or downloaded information about variants and genomic regions.
The presentation will include interactive demonstrations using VarSeq to analyze variants found by exome sequencing of an extended family with a complex disease. We will review strategies for assessing variant quality, applying genomic annotations, incorporating custom annotation sources, and creating variant filters in VarSeq. We will also demonstrate the PhoRank gene ranking algorithm and its application for prioritizing variants.
Using VarSeq to Improve Variant Analysis Research WorkflowsGolden Helix Inc
In this webinar presentation, we will review workflow strategies for quality control and analysis of DNA sequence variants using the VarSeq software package from Golden Helix. VarSeq is a powerful platform for analysis of DNA sequence variants in clinical and translational research settings. VarSeq provides researchers with easy access to curated public databases of variant annotation information, and also enables users to incorporate their own local databases or downloaded information about variants and genomic regions.
This dissertation developed algorithms and software tools to analyze the biological role of low complexity regions (LCRs) in proteins. It evaluated and improved methods for identifying homologs containing LCRs. It also created LCR-eXXXplorer, a web resource with unique tools for exploring annotated LCRs among millions of proteins. Using these tools, the dissertation predicted pathogenicity of E. coli strains based on genomic composition, showing prediction is possible with limited data like from metagenomic samples. The results open new areas for research on sequence search validation and large-scale experiments.
Similar to Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015 (20)
Innovations in Sequencing & Bioinformatics
Talk for
Healthy Central Valley Together Research Workshop
Jonathan A. Eisen University of California, Davis
January 31, 2024 linktr.ee/jonathaneisen
Talk by Jonathan Eisen for LAMG2022 meetingJonathan Eisen
The document discusses the history of the Lake Arrowhead Microbial Genomes (LAMG) conference. It reveals that LAMG2020 was cancelled due to a secret plan by organizers who formed an "anti-karyote society" that hates eukaryotes. The meeting was to be renamed the "Big, Large, Enormous" meeting of the Lake Arrowhead Big Large Enormous Anti-Karyote Society. The document also hints that several past LAMG speakers have made cryptic comments indicating involvement in a conspiracy surrounding the conference.
Thoughts on UC Davis' COVID Current ActionsJonathan Eisen
Slides I used for a presentation to Chancellor May's leadership council about the current state of UC Davis' response to COVID and how it could be improved
Phylogenetic and Phylogenomic Approaches to the Study of Microbes and Microbi...Jonathan Eisen
The document discusses Jonathan Eisen's work as a microbiology professor at UC Davis. It provides an overview of his research topics, which include microbial phylogenomics and evolvability, phylogenetic methods and tools, and using phylogenomics to study microbial communities and interactions between microbes and hosts under stress. The document also acknowledges collaborators and funding sources for Eisen's research over the years.
This document summarizes a class on detecting, quantifying, and tracking variations of SARS-CoV-2 RNA from COVID-19 samples. It discusses using quantitative RT-PCR (qRT-PCR) to detect and measure viral RNA levels in samples. Sequencing is used to identify variations in the viral genome over time, and online tools like Nextstrain allow viewing the evolution and global transmission of variants. Genotyping assays are also described that can rapidly screen samples for known single nucleotide variations during PCR.
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise causes chemical changes in the brain that may help protect against mental illness and improve symptoms.
EVE198 Winter2020 Class 8 - COVID RNA DetectionJonathan Eisen
This document summarizes a class on SARS-CoV-2 RNA detection, quantification, and variation. It discusses how qRT-PCR is used to detect and quantify the virus by amplifying and detecting viral RNA. It also covers sequencing to identify variants, how variants evolve over time, and genotyping assays that can screen samples for known single nucleotide variations. Nextstrain and other online tools are presented that use sequencing data to analyze viral phylogenies, track variant distributions globally, and visualize genetic variations across the SARS-CoV-2 genome.
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive function. Exercise causes chemical changes in the brain that may help protect against mental illness and improve symptoms for those who already suffer from conditions like depression and anxiety.
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive function. Exercise causes chemical changes in the brain that may help protect against mental illness and improve symptoms.
EVE198 Winter2020 Class 5 - COVID VaccinesJonathan Eisen
The document discusses a class on COVID-19 vaccines. It covers topics like vaccine development, current candidates, delivery challenges, and comparisons between vaccines. Moderna and Pfizer mRNA vaccines are highlighted as being similar but having some differences in mRNA region, nanoparticle structure/synthesis, dosage amount, and storage temperature requirements. Other vaccines discussed include Novavax using spike protein nanoparticles, and AstraZeneca and Johnson & Johnson using DNA for spike protein delivered by a modified virus.
EVE198 Winter2020 Class 9 - COVID TransmissionJonathan Eisen
This document discusses modes of SARS-CoV-2 transmission including droplets, aerosols, and surfaces. It emphasizes that surfaces are not as big a risk as initially thought. It provides guidance on limiting transmission from different modes such as distancing, masks, washing hands, cleaning surfaces, and improving ventilation. The focus in 2021 is on droplets and aerosols rather than surfaces.
EVE198 Fall2020 "Covid Mass Testing" Class 8 VaccinesJonathan Eisen
This document discusses a class on vaccines for COVID-19. It covers topics like vaccine development, current candidate vaccines, challenges with vaccine distribution, and how vaccines are being assessed for safety, effectiveness, costs and production feasibility. Over 100 vaccine candidates are in development using platforms like DNA, RNA, viral vectors and inactivated viruses. Efforts like Operation Warp Speed are coordinating development of nucleic acid, viral vector and protein subunit vaccines. Distribution challenges include vaccine production, storage and logistics, number of doses required, and overcoming vaccine nationalism and hesitancy.
EVE198 Fall2020 "Covid Mass Testing" Class 2: Viruses, COIVD and TestingJonathan Eisen
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive function. Exercise causes chemical changes in the brain that may help protect against mental illness and improve symptoms.
EVE198 Fall2020 "Covid Mass Testing" Class 1 IntroductionJonathan Eisen
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive function. Exercise causes chemical changes in the brain that may help protect against mental illness and improve symptoms.
Anti-Universe And Emergent Gravity and the Dark UniverseSérgio Sacani
Recent theoretical progress indicates that spacetime and gravity emerge together from the entanglement structure of an underlying microscopic theory. These ideas are best understood in Anti-de Sitter space, where they rely on the area law for entanglement entropy. The extension to de Sitter space requires taking into account the entropy and temperature associated with the cosmological horizon. Using insights from string theory, black hole physics and quantum information theory we argue that the positive dark energy leads to a thermal volume law contribution to the entropy that overtakes the area law precisely at the cosmological horizon. Due to the competition between area and volume law entanglement the microscopic de Sitter states do not thermalise at sub-Hubble scales: they exhibit memory effects in the form of an entropy displacement caused by matter. The emergent laws of gravity contain an additional ‘dark’ gravitational force describing the ‘elastic’ response due to the entropy displacement. We derive an estimate of the strength of this extra force in terms of the baryonic mass, Newton’s constant and the Hubble acceleration scale a0 = cH0, and provide evidence for the fact that this additional ‘dark gravity force’ explains the observed phenomena in galaxies and clusters currently attributed to dark matter.
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDSSérgio Sacani
The pathway(s) to seeding the massive black holes (MBHs) that exist at the heart of galaxies in the present and distant Universe remains an unsolved problem. Here we categorise, describe and quantitatively discuss the formation pathways of both light and heavy seeds. We emphasise that the most recent computational models suggest that rather than a bimodal-like mass spectrum between light and heavy seeds with light at one end and heavy at the other that instead a continuum exists. Light seeds being more ubiquitous and the heavier seeds becoming less and less abundant due the rarer environmental conditions required for their formation. We therefore examine the different mechanisms that give rise to different seed mass spectrums. We show how and why the mechanisms that produce the heaviest seeds are also among the rarest events in the Universe and are hence extremely unlikely to be the seeds for the vast majority of the MBH population. We quantify, within the limits of the current large uncertainties in the seeding processes, the expected number densities of the seed mass spectrum. We argue that light seeds must be at least 103 to 105 times more numerous than heavy seeds to explain the MBH population as a whole. Based on our current understanding of the seed population this makes heavy seeds (Mseed > 103 M⊙) a significantly more likely pathway given that heavy seeds have an abundance pattern than is close to and likely in excess of 10−4 compared to light seeds. Finally, we examine the current state-of-the-art in numerical calculations and recent observations and plot a path forward for near-future advances in both domains.
PPT on Direct Seeded Rice presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptxshubhijain836
Centrifugation is a powerful technique used in laboratories to separate components of a heterogeneous mixture based on their density. This process utilizes centrifugal force to rapidly spin samples, causing denser particles to migrate outward more quickly than lighter ones. As a result, distinct layers form within the sample tube, allowing for easy isolation and purification of target substances.
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...Sérgio Sacani
Wereport the study of a huge optical intraday flare on 2021 November 12 at 2 a.m. UT in the blazar OJ287. In the binary black hole model, it is associated with an impact of the secondary black hole on the accretion disk of the primary. Our multifrequency observing campaign was set up to search for such a signature of the impact based on a prediction made 8 yr earlier. The first I-band results of the flare have already been reported by Kishore et al. (2024). Here we combine these data with our monitoring in the R-band. There is a big change in the R–I spectral index by 1.0 ±0.1 between the normal background and the flare, suggesting a new component of radiation. The polarization variation during the rise of the flare suggests the same. The limits on the source size place it most reasonably in the jet of the secondary BH. We then ask why we have not seen this phenomenon before. We show that OJ287 was never before observed with sufficient sensitivity on the night when the flare should have happened according to the binary model. We also study the probability that this flare is just an oversized example of intraday variability using the Krakow data set of intense monitoring between 2015 and 2023. We find that the occurrence of a flare of this size and rapidity is unlikely. In machine-readable Tables 1 and 2, we give the full orbit-linked historical light curve of OJ287 as well as the dense monitoring sample of Krakow.
The cost of acquiring information by natural selectionCarl Bergstrom
This is a short talk that I gave at the Banff International Research Station workshop on Modeling and Theory in Population Biology. The idea is to try to understand how the burden of natural selection relates to the amount of information that selection puts into the genome.
It's based on the first part of this research paper:
The cost of information acquisition by natural selection
Ryan Seamus McGee, Olivia Kosterlitz, Artem Kaznatcheev, Benjamin Kerr, Carl T. Bergstrom
bioRxiv 2022.07.02.498577; doi: https://doi.org/10.1101/2022.07.02.498577
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆Sérgio Sacani
Context. The early-type galaxy SDSS J133519.91+072807.4 (hereafter SDSS1335+0728), which had exhibited no prior optical variations during the preceding two decades, began showing significant nuclear variability in the Zwicky Transient Facility (ZTF) alert stream from December 2019 (as ZTF19acnskyy). This variability behaviour, coupled with the host-galaxy properties, suggests that SDSS1335+0728 hosts a ∼ 106M⊙ black hole (BH) that is currently in the process of ‘turning on’. Aims. We present a multi-wavelength photometric analysis and spectroscopic follow-up performed with the aim of better understanding the origin of the nuclear variations detected in SDSS1335+0728. Methods. We used archival photometry (from WISE, 2MASS, SDSS, GALEX, eROSITA) and spectroscopic data (from SDSS and LAMOST) to study the state of SDSS1335+0728 prior to December 2019, and new observations from Swift, SOAR/Goodman, VLT/X-shooter, and Keck/LRIS taken after its turn-on to characterise its current state. We analysed the variability of SDSS1335+0728 in the X-ray/UV/optical/mid-infrared range, modelled its spectral energy distribution prior to and after December 2019, and studied the evolution of its UV/optical spectra. Results. From our multi-wavelength photometric analysis, we find that: (a) since 2021, the UV flux (from Swift/UVOT observations) is four times brighter than the flux reported by GALEX in 2004; (b) since June 2022, the mid-infrared flux has risen more than two times, and the W1−W2 WISE colour has become redder; and (c) since February 2024, the source has begun showing X-ray emission. From our spectroscopic follow-up, we see that (i) the narrow emission line ratios are now consistent with a more energetic ionising continuum; (ii) broad emission lines are not detected; and (iii) the [OIII] line increased its flux ∼ 3.6 years after the first ZTF alert, which implies a relatively compact narrow-line-emitting region. Conclusions. We conclude that the variations observed in SDSS1335+0728 could be either explained by a ∼ 106M⊙ AGN that is just turning on or by an exotic tidal disruption event (TDE). If the former is true, SDSS1335+0728 is one of the strongest cases of an AGNobserved in the process of activating. If the latter were found to be the case, it would correspond to the longest and faintest TDE ever observed (or another class of still unknown nuclear transient). Future observations of SDSS1335+0728 are crucial to further understand its behaviour. Key words. galaxies: active– accretion, accretion discs– galaxies: individual: SDSS J133519.91+072807.4
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...Sérgio Sacani
We present the JWST discovery of SN 2023adsy, a transient object located in a host galaxy JADES-GS
+
53.13485
−
27.82088
with a host spectroscopic redshift of
2.903
±
0.007
. The transient was identified in deep James Webb Space Telescope (JWST)/NIRCam imaging from the JWST Advanced Deep Extragalactic Survey (JADES) program. Photometric and spectroscopic followup with NIRCam and NIRSpec, respectively, confirm the redshift and yield UV-NIR light-curve, NIR color, and spectroscopic information all consistent with a Type Ia classification. Despite its classification as a likely SN Ia, SN 2023adsy is both fairly red (
�
(
�
−
�
)
∼
0.9
) despite a host galaxy with low-extinction and has a high Ca II velocity (
19
,
000
±
2
,
000
km/s) compared to the general population of SNe Ia. While these characteristics are consistent with some Ca-rich SNe Ia, particularly SN 2016hnk, SN 2023adsy is intrinsically brighter than the low-
�
Ca-rich population. Although such an object is too red for any low-
�
cosmological sample, we apply a fiducial standardization approach to SN 2023adsy and find that the SN 2023adsy luminosity distance measurement is in excellent agreement (
≲
1
�
) with
Λ
CDM. Therefore unlike low-
�
Ca-rich SNe Ia, SN 2023adsy is standardizable and gives no indication that SN Ia standardized luminosities change significantly with redshift. A larger sample of distant SNe Ia is required to determine if SN Ia population characteristics at high-
�
truly diverge from their low-
�
counterparts, and to confirm that standardized luminosities nevertheless remain constant with redshift.
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptxgoluk9330
Ahota Beel, nestled in Sootea Biswanath Assam , is celebrated for its extraordinary diversity of bird species. This wetland sanctuary supports a myriad of avian residents and migrants alike. Visitors can admire the elegant flights of migratory species such as the Northern Pintail and Eurasian Wigeon, alongside resident birds including the Asian Openbill and Pheasant-tailed Jacana. With its tranquil scenery and varied habitats, Ahota Beel offers a perfect haven for birdwatchers to appreciate and study the vibrant birdlife that thrives in this natural refuge.
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonathan Eisen at UCSB Feb 2015
1. Phylogeny-Driven Approaches to
Studies of Microbial and Microbiome
Diversity
Jonathan A. Eisen
University of California, Davis
@phylogenomics
February 7, 2015
UCSB EEMB Graduate Student Symposium
2. Phylogeny-Driven Approaches to
Studies of Microbial and Microbiome
Diversity
Jonathan A. Eisen
University of California, Davis
@phylogenomics
February 7, 2015
UCSB EEMB Graduate Student Symposium
Some Lessons I
Think I Have
Learned
3. Phylogeny-Driven Approaches to
Studies of Microbial and Microbiome
Diversity
Jonathan A. Eisen
University of California, Davis
@phylogenomics
February 7, 2015
UCSB EEMB Graduate Student Symposium
Lesson 1:
Go With Your
Obsessions
19. Tree from Woese. 1987.
Microbiological Reviews 51:221
Map for Graduate School
Lesson 3:
Go Fishing Where
Nobody Else Has
20. Example II: Rice Microbiomes and Phylogeny
Joseph
Edwards
@Bulk_Soil
Sundar
@sundarlab
Cameron
Johnson
Srijak
Bhatnagar
@srijakbhatnagar
Edwards et al. 2015. Structure, variation,
and assembly of the root-associated
microbiomes of rice. PNAS
Supplementary Figures1
2
Fig. S1 Map depicting soil collection locations for greenhouse experiment.3
10
234
Fig. S2. Sampling and collection of the rhizocompartments. Roots are collected from rice235
plants and soil is shaken off the roots to leave ~1mm of soil around the roots. The ~1 mm of soil236
21. DNA
extraction
PCR
Sequence
rRNA genes
Sequence alignment = Data matrixPhylogenetic tree
PCR
rRNA1
rRNA2
Makes lots of
copies of the
rRNA genes
in sample
rRNA1
5’...ACACACATAGGTGGAGCTA
GCGATCGATCGA... 3’
E. coli
Humans
A
T
T
A
G
A
A
C
A
T
C
A
C
A
A
C
A
G
G
A
G
T
T
C
rRNA1
E. coli Humans
rRNA2
rRNA2
5’..TACAGTATAGGTGGAGCTAG
CGACGATCGA... 3’
rRNA3
5’...ACGGCAAAATAGGTGGATT
CTAGCGATATAGA... 3’
rRNA4
5’...ACGGCCCGATAGGTGGATT
CTAGCGCCATAGA... 3’
rRNA3 C A C T G T
rRNA4 C A C A G T
Yeast T A C A G T
Yeast
rRNA3
rRNA4
Phylogeny
PCR and phylogenetic analysis of rRNA genes
22. STAP
An Automated Phylogenetic Tree-Based Small Subunit
rRNA Taxonomy and Alignment Pipeline (STAP)
Dongying Wu1
*, Amber Hartman1,6
, Naomi Ward4,5
, Jonathan A. Eisen1,2,3
1 UC Davis Genome Center, University of California Davis, Davis, California, United States of America, 2 Section of Evolution and Ecology, College of Biological Sciences,
University of California Davis, Davis, California, United States of America, 3 Department of Medical Microbiology and Immunology, School of Medicine, University of
California Davis, Davis, California, United States of America, 4 Department of Molecular Biology, University of Wyoming, Laramie, Wyoming, United States of America,
5 Center of Marine Biotechnology, Baltimore, Maryland, United States of America, 6 The Johns Hopkins University, Department of Biology, Baltimore, Maryland, United
States of America
Abstract
Comparative analysis of small-subunit ribosomal RNA (ss-rRNA) gene sequences forms the basis for much of what we know
about the phylogenetic diversity of both cultured and uncultured microorganisms. As sequencing costs continue to decline
and throughput increases, sequences of ss-rRNA genes are being obtained at an ever-increasing rate. This increasing flow of
data has opened many new windows into microbial diversity and evolution, and at the same time has created significant
methodological challenges. Those processes which commonly require time-consuming human intervention, such as the
preparation of multiple sequence alignments, simply cannot keep up with the flood of incoming data. Fully automated
methods of analysis are needed. Notably, existing automated methods avoid one or more steps that, though
computationally costly or difficult, we consider to be important. In particular, we regard both the building of multiple
sequence alignments and the performance of high quality phylogenetic analysis to be necessary. We describe here our fully-
automated ss-rRNA taxonomy and alignment pipeline (STAP). It generates both high-quality multiple sequence alignments
and phylogenetic trees, and thus can be used for multiple purposes including phylogenetically-based taxonomic
assignments and analysis of species diversity in environmental samples. The pipeline combines publicly-available packages
(PHYML, BLASTN and CLUSTALW) with our automatic alignment, masking, and tree-parsing programs. Most importantly,
this automated process yields results comparable to those achievable by manual analysis, yet offers speed and capacity that
are unattainable by manual efforts.
Citation: Wu D, Hartman A, Ward N, Eisen JA (2008) An Automated Phylogenetic Tree-Based Small Subunit rRNA Taxonomy and Alignment Pipeline (STAP). PLoS
ONE 3(7): e2566. doi:10.1371/journal.pone.0002566
multiple alignment and phylogeny was deemed unfeasible.
However, this we believe can compromise the value of the results.
For example, the delineation of OTUs has also been automated
via tools that do not make use of alignments or phylogenetic trees
(e.g., Greengenes). This is usually done by carrying out pairwise
comparisons of sequences and then clustering of sequences that
have better than some cutoff threshold of similarity with each
other). This approach can be powerful (and reasonably efficient)
but it too has limitations. In particular, since multiple sequence
alignments are not used, one cannot carry out standard
phylogenetic analyses. In addition, without multiple sequence
alignments one might end up comparing and contrasting different
regions of a sequence depending on what it is paired with.
The limitations of avoiding multiple sequence alignments and
phylogenetic analysis are readily apparent in tools to classify
sequences. For example, the Ribosomal Database Project’s
Classifier program [29] focuses on composition characteristics of
each sequence (e.g., oligonucleotide frequency) and assigns
taxonomy based upon clustering genes by their composition.
Though this is fast and completely automatable, it can be misled in
cases where distantly related sequences have converged on similar
composition, something known to be a major problem in ss-rRNA
sequences [30]. Other taxonomy assignment systems focus
primarily on the similarity of sequences. The simplest of these is
classification tools it does have some limitations. For example,
the generation of new alignments for each sequence is both
computational costly, and does not take advantage of available
curated alignments that make use of ss-RNA secondary structure
to guide the primary sequence alignment. Perhaps most
importantly however is that the tool is not fully automated. In
addition, it does not generate multiple sequence alignments for all
sequences in a dataset which would be necessary for doing many
analyses.
Automated methods for analyzing rRNA sequences are also
available at the web sites for multiple rRNA centric databases,
such as Greengenes and the Ribosomal Database Project (RDPII).
Though these and other web sites offer diverse powerful tools, they
do have some limitations. For example, not all provide multiple
sequence alignments as output and few use phylogenetic
approaches for taxonomy assignments or other analyses. More
importantly, all provide only web-based interfaces and their
integrated software, (e.g., alignment and taxonomy assignment),
cannot be locally installed by the user. Therefore, the user cannot
take advantage of the speed and computing power of parallel
processing such as is available on linux clusters, or locally alter and
potentially tailor these programs to their individual computing
needs (Table 1).
Given the limited automated tools that are available for
Table 1. Comparison of STAP’s computational abilities relative to existing commonly-used ss-RNA analysis tools.
STAP ARB Greengenes RDP
Installed where? Locally Locally Web only Web only
User interface Command line GUI Web portal Web portal
Parallel processing YES NO NO NO
Manual curation for taxonomy assignment NO YES NO NO
Manual curation for alignment NO YES NO* NO
Open source YES** NO NO NO
Processing speed Fast Slow Medium Medium
It is important to note, that STAP is the only software that runs on the command line and can take advantage of parallel processing on linux clusters and, further, is
more amenable to downstream code manipulation.
*
Note: Greengenes alignment output is compatible with upload into ARB and downstream manual alignment.
**
The STAP program itself is open source, the programs it depends on are freely available but not open source.
doi:10.1371/journal.pone.0002566.t001
ss-rRNA Taxonomy Pipeline
STAP database, and the query sequence is aligned to them using
the CLUSTALW profile alignment algorithm [40] as described
above for domain assignment. By adapting the profile alignment
algorithm, th
while gaps ar
sequence ac
Figure 1. A flow chart of the STAP pipeline.
doi:10.1371/journal.pone.0002566.g001
STAP database, and the query sequence is aligned to them using
the CLUSTALW profile alignment algorithm [40] as described
above for domain assignment. By adapting the profile alignment
algorithm, the alignments from the STAP database remain intact,
while gaps are inserted and nucleotides are trimmed for the query
sequence according to the profile defined by the previous
alignments from the databases. Thus the accuracy and quality of
the alignment generated at this step depends heavily on the quality
of the Bacterial/Archaeal ss-rRNA alignments from the
Greengenes project or the Eukaryotic ss-rRNA alignments from
the RDPII project.
Phylogenetic analysis using multiple sequence alignments rests on
the assumption that the residues (nucleotides or amino acids) at the
same position in every sequence in the alignment are homologous.
Thus, columns in the alignment for which ‘‘positional homology’’
cannot be robustly determined must be excluded from subsequent
analyses. This process of evaluating homology and eliminating
questionable columns, known as masking, typically requires time-
consuming, skillful, human intervention. We designed an automat-
ed masking method for ss-rRNA alignments, thus eliminating this
bottleneck in high-throughput processing.
First, an alignment score is calculated for each aligned column
by a method similar to that used in the CLUSTALX package [42].
Specifically, an R-dimensional sequence space representing all the
possible nucleotide character states is defined. Then for each
aligned column, the nucleotide populating that column in each of
the aligned sequences is assigned a score in each of the R
dimensions (Sr) according to the IUB matrix [42]. The consensus
‘‘nucleotide’’ for each column (X) also has R dimensions, with the
Figure 2. Domain assignment. In Step 1, STAP assigns a domain to
each query sequence based on its position in a maximum likelihood
tree of representative ss-rRNA sequences. Because the tree illustrated
here is not rooted, domain assignment would not be accurate and
Figure 1. A flow chart of the STAP pipeline.
doi:10.1371/journal.pone.0002566.g001
ss-rRNA Taxonomy Pipeline
Dongying Wu
Amber
Hartman Naomi Ward
23. WATERsPage 2 of 14
chimeric sequences generated during PCR identifying
closely related sets of sequences (also known as opera-
tional taxonomic units or OTUs), removing redundant
sequences above a certain percent identity cutoff, assign-
ing putative taxonomic identifiers to each sequence or
representative of a group, inferring a phylogenetic tree of
the sequences, and comparing the phylogenetic structure
Figure 1 Overview of WATERS. Schema of WATERS where white
boxes indicate "behind the scenes" analyses that are performed in WA-
TERS. Quality control files are generated for white boxes, but not oth-
erwise routinely analyzed. Black arrows indicate that metadata (e.g.,
sample type) has been overlaid on the data for downstream interpre-
tation. Colored boxes indicate different types of results files that are
generated for the user for further use and biological interpretation.
Colors indicate different types of WATERS actors from Fig. 2 which
were used: green, Diversity metrics, WriteGraphCoordinates, Diversity
graphs; blue, Taxonomy, BuildTree, Rename Trees, Save Trees; Create-
Unifrac; yellow, CreateOtuTable, CreateCytoscape, CreateOTUFile;
white, remaining unnamed actors.
Align
Check
chimeras
Cluster Build
Tree
Assign
Taxonomy
Tree w/
Taxonomy
Diversity
statistics &
graphs
Unifrac
files
Cytoscape
network
OTU table
Hartman et al 2010. W.A.T.E.R.S.: a Workflow for the Alignment,
Taxonomy, and Ecology of Ribosomal Sequences. BMC Bioinformatics
2010, 11:317 doi:10.1186/1471-2105-11-317
Hartman et al. BMC Bioinformatics 2010, 11:317
http://www.biomedcentral.com/1471-2105/11/317
Page 9 of 14
default is 97% and 99%), and they are also generated for
every metadata variable comparison that the user
includes.
Data pruning
To assist in troubleshooting and quality control,
WATERS returns to the user three fasta files of sequences
that were removed at various steps in the workflow. A
short_sequences.fas file is created that contains all
Figure 3 Biologically similar results automatically produced by WATERS on published colonic microbiota samples. (A) Rarefaction curves sim-
ilar to curves shown in Eckburg et al. Fig. 2; 70-72, indicate patient numbers, i.e., 3 different individuals. (B) Weighted Unifrac analysis based on phylo-
genetic tree and OTU data produced by WATERS very similar to Eckburg et al. Fig. 3B. (C) Neighbor-joining phylogenetic tree (Quicktree) representing
the sequences analyzed by WATERS, which is clearly similar to Fig. S1 in Eckburg et al.
BA
3 3HUFHQW YDULDWLRQ H[SODLQHG
33HUFHQWYDULDWLRQH[SODLQHG
$%
&
'(
)
6
$ %
&
'(
)
6
$
%&
'
()
6
3&$ 3 YV 3
C
%$&7(52,'(7(6
%$&7(52,'$/(6
'(/7$3527(2%$&7(5,$
$&7,12%$&7(5,$
9(558&20,&52%,$
(36,/213527(2%$&7(5,$
),50,&87(6
&/2675,',$
&/2675,',$/(6
*$00$3527(2%$&7(5,$
&<$12%$&7(5,$
$/3+$3527(2%$&7(5,$
)862%$&7(5,$
),50,&87(6
%$&,//,
),50,&87(6
02//,&87(6
Amber
Hartman
Bertram
Ludaescer
24. alignment used to build the profile, resulting in a multiple
sequence alignment of full-length reference sequences and
PD versus PID clustering, 2) to explore overlap between PhylOTU
clusters and recognized taxonomic designations, and 3) to quantify
Figure 1. PhylOTU Workflow. Computational processes are represented as squares and databases are represented as cylinders in this generalize
workflow of PhylOTU. See Results section for details.
doi:10.1371/journal.pcbi.1001061.g001
Finding Metagenomic OTUs
Sharpton TJ, Riesenfeld SJ, Kembel SW, Ladau J, O'Dwyer JP, Green JL, Eisen JA,
Pollard KS. (2011) PhylOTU: A High-Throughput Procedure Quantifies Microbial
Community Diversity and Resolves Novel Taxa from Metagenomic Data. PLoS Comput
Biol 7(1): e1001061. doi:10.1371/journal.pcbi.1001061
PhylOTU
Tom Sharpton
@tjsharpton
25. QIIME Phylotyping and Phylogenetic Ecology
296
Fig. S6. A set of 96 OTUs mainly consisting of Proteobacteria is297
compartment in the greenhouse experiment. (A) Number of OTU298
they belong to that are enriched across all rhizocompartments in the299
A subset of the Proteobacteria and the classes and families they belo300
enriched across all rhizocompartments in the greenhouse.301
https://evomics.org/2014/01/the-glories-of-the-gut-ask-a-fat-mouse/
26. QIIME Phylotyping and Phylogenetic Ecology
296
Fig. S6. A set of 96 OTUs mainly consisting of Proteobacteria is297
compartment in the greenhouse experiment. (A) Number of OTU298
they belong to that are enriched across all rhizocompartments in the299
A subset of the Proteobacteria and the classes and families they belo300
enriched across all rhizocompartments in the greenhouse.301
https://evomics.org/2014/01/the-glories-of-the-gut-ask-a-fat-mouse/
Lesson 4:
Accept When You
Are Defeated
27. Rice Microbiome: Variation w/in Plant
Joseph
Edwards
@Bulk_Soil
Sundar
@sundarlab
Cameron
Johnson
Srijak
Bhatnagar
@srijakbhatnagar
growth. For our study, the rhizosphere compartment was com-
the un
sitive t
zocomp
indicat
microb
and SI
ration
the ext
terior o
(PERM
talizati
microb
P < 0.0
howeve
the sec
P < 0.0
perform
(CAP)
iance a
Materia
PCoA
analysi
terest t
on the
soil typ
quenci
agreem
Fig. 1. Root-associated microbial communities are separable by rhizo-
compartment and soil type. (A) A representation of a rice root cross-section
depicting the locations of the microbial communities sampled. (B) Within-
sample diversity (α-diversity) measurements between rhizospheric compart-
ments indicate a decreasing gradient in microbial diversity from the rhizo-
sphere to the endosphere independent of soil type. Estimated species
Edwards et al. 2015. Structure, variation,
and assembly of the root-associated
microbiomes of rice. PNAS
28. Rice Genotype Affects Microbiome
rhizocompartments were analyzed as before. Unfortunately,
collection of bulk soil controls for the field experiment was not
Fig. 3. Host plant genotype significantly affects microbial communities in
the rhizospheric compartments. (A) Ordination of CAP analysis using the
WUF metric constrained to rice genotype. (B) Within-sample diversity
measurements of rhizosphere samples of each cultivar grown in each soil.
Estimated species richness was calculated as eShannon_entropy
. The horizontal
Edwards et al. 2015. Structure, variation,
and assembly of the root-associated
microbiomes of rice. PNAS
29. Rice: Cultivation Site Effects
Edwards et al. 2015.
Structure, variation, and
assembly of the root-
associated
microbiomes of rice.
PNAS
the field plants again showed that the rhizosphere had the
highest microbial diversity, whereas the endosphere had the least
found to be enriche
greenhouse plants (S
OTUs were classifiabl
sisted of taxa in the fa
and Myxococcaceae, al
bidopsis root endosphe
Cultivation Practice Result
The rice fields that we
practices, organic farmi
tion called ecofarming
farming in that chemica
are all permitted but g
harvest fumigants are n
itself does significantly
partments overall (P =
a significant interaction
the rhizocompartments
indicating that the α-d
affected differentially by
the rhizosphere compa
practice, with the mean
zospheres than organic
Dataset S14), whereas
crobial communities (P
tests; Dataset S14). Un
practices are separable a
the WUF metric (Fig.
30. Rice: Functional Enrichment x Genotype
and mitochondrial) reads to analyze microbial abundance in
the endosphere over time (Fig. 6A). Using this technique, we
confirmed the sterility of seedling roots before transplantation.
(13 d) approach the endosphere and rhizoplane microbiome
compositions for plants that have been grown in the green-
house for 42 d.
Fig. 5. OTU coabundance network reveals modules of OTUs associated with methane cycling. (A) Subset of the entire network corresponding to 11
modules with methane cycling potential. Each node represents one OTU and an edge is drawn between OTUs if they share a Pearson correlation of
greater than or equal to 0.6. (B) Depiction of module 119 showing the relationship between methanogens, syntrophs, methanotrophs, and other
methane cycling taxonomies. Each node represents one OTU and is labeled by the presumed function of that OTU’s taxonomy in methane cycling. An
edge is drawn between two OTUs if they have a Pearson correlation of greater than or equal to 0.6. (C) Mean abundance profile for OTUs in module 119
across all rhizocompartments and field sites. The position along the x axis corresponds to a different field site. Error bars represent SE. The x and y axes
represent no particular scale.
Edwards et al. 2015. Structure, variation, and assembly of the root-associated
microbiomes of rice. PNAS
31. Rice Developmental Time Series
of magnitude greater than in any single plant species
Under controlled greenhouse conditions, the rhizocomp
described the largest source of variation in the microb
munities sampled (Dataset S5A). The pattern of separ
tween the microbial communities in each compar
consistent with a spatial gradient from the bulk soil a
rhizosphere and rhizoplane into the endosphere (F
Similarly, microbial diversity patterns within samples
same pattern where there is a gradient in α-diversity
rhizosphere to the endosphere (Fig. 1B). Enrichment
pletion of certain microbes across the rhizocompartme
cates that microbial colonization of rice roots is not a
process and that plants have the ability to select for ce
crobial consortia or that some microbes are better at f
root colonizing niche. Similar to studies in Arabidopsis, w
that the relative abundance of Proteobacteria is increas
endosphere compared with soil, and that the relative abu
of Acidobacteria and Gemmatimonadetes decrease from
to the endosphere (9–11), suggesting that the distrib
different bacterial phyla inside the roots might be simil
land plants (Fig. 1D and Dataset S6). Under controlle
house conditions, soil type described the second large
of variation within the microbial communities of each
However, the soil source did not affect the pattern of se
between the rhizospheric compartments, suggesting
rhizocompartments exert a recruitment effect on micro
sortia independent of the microbiome source.
By using differential OTU abundance analysis in t
partments, we observed that the rhizosphere serves an
ment role for a subset of microbial OTUs relative to
(Fig. 2). Further, the majority of the OTUs enriche
rhizosphere are simultaneously enriched in the rhizoplan
endosphere of rice roots (Fig. 2B and SI Appendix, Fig
consistent with a recruitment model in which factors pro
the root attract taxa that can colonize the endosphere. W
that the rhizoplane, although enriched for OTUs that
enriched in the endosphere, is also uniquely enriched for
of OTUs, suggesting that the rhizoplane serves as a sp
Edwards et al. 2015.
Structure, variation, and
assembly of the root-
associated
microbiomes of rice.
PNAS
32. Tree from Woese. 1987.
Microbiological Reviews 51:221
Example III: rRNA Not Perfect
Lesson 5:
Nothing is Perfect
33. Tree from Woese. 1987.
Microbiological Reviews 51:221
Taxa Phylogeny III: rRNA Not Perfect
34. rRNA Copy # Correction by Phylogeny
Kembel SW, Wu M, Eisen JA, Green JL (2012) Incorporating 16S Gene Copy Number Information Improves Estimates
of Microbial Diversity and Abundance. PLoS Comput Biol 8(10): e1002743. doi:10.1371/journal.pcbi.1002743
Jessica Green
@jessicaleegreen
Steven Kembel
@stevenkembel
Martin Wu
37. RecA vs. rRNA
Eisen 1995 Journal of Molecular Evolution 41: 1105-1123..
Lesson 6:
Keep Going Back
to Your Past
38. Phylotyping w/ Protein Markers
AMPHORA
http://genomebiology.com/2008/9/10/R151 Genome Biology 2008, Volume 9, Issue 10, Article R151 Wu and Eisen R151.7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Alphaproteobacteria
Betaproteobacteria
G
am
m
aproteobacteria
D
eltaproteobacteria
Epsilonproteobacteria
U
nclassified
proteobacteria
Bacteroidetes
C
hlam
ydiae
C
yanobacteria
Acidobacteria
Therm
otogae
Fusobacteria
ActinobacteriaAquificae
Planctom
ycetes
Spirochaetes
Firm
icutes
C
hloroflexiC
hlorobi
U
nclassified
bacteria
dnaG
frr
infC
nusA
pgk
pyrG
rplA
rplB
rplC
rplD
rplE
rplF
rplK
rplL
rplM
rplN
rplP
rplS
rplT
rpmA
rpoB
rpsB
rpsC
rpsE
rpsI
rpsJ
rpsK
rpsM
rpsS
smpB
tsf
Relativeabundance
Martin Wu
39. GOS 1
GOS 2
GOS 3
GOS 4
GOS 5
Phylogenetic ID of Novel Lineages
Wu et al PLoS One 2011
Dongying Wu
40. Phylogenetic Diversity of Metagenomes
typically used as a qualitative measure because duplicate s
quences are usually removed from the tree. However, the
test may be used in a semiquantitative manner if all clone
even those with identical or near-identical sequences, are i
cluded in the tree (13).
Here we describe a quantitative version of UniFrac that w
call “weighted UniFrac.” We show that weighted UniFrac b
haves similarly to the FST test in situations where both a
FIG. 1. Calculation of the unweighted and the weighted UniFr
measures. Squares and circles represent sequences from two differe
environments. (a) In unweighted UniFrac, the distance between t
circle and square communities is calculated as the fraction of t
branch length that has descendants from either the square or the circ
environment (black) but not both (gray). (b) In weighted UniFra
branch lengths are weighted by the relative abundance of sequences
the square and circle communities; square sequences are weight
twice as much as circle sequences because there are twice as many tot
circle sequences in the data set. The width of branches is proportion
to the degree to which each branch is weighted in the calculations, an
gray branches have no weight. Branches 1 and 2 have heavy weigh
since the descendants are biased toward the square and circles, respe
tively. Branch 3 contributes no value since it has an equal contributio
from circle and square sequences after normalization.
Kembel SW, Eisen JA, Pollard KS, Green JL (2011) The Phylogenetic Diversity of
Metagenomes. PLoS ONE 6(8): e23214. doi:10.1371/journal.pone.0023214
Jessica
Green
Steven
Kembel
Katie
Pollard
41. Phylosift/ pplacer Workflow
Input Sequences
rRNA workflow
protein workflow
profile HMMs used to align
candidates to reference alignment
Taxonomic
Summaries
parallel option
hmmalign
multiple alignment
LAST
fast candidate search
pplacer
phylogenetic placement
LAST
fast candidate search
LAST
fast candidate search
search input against references
hmmalign
multiple alignment
hmmalign
multiple alignment
Infernal
multiple alignment
LAST
fast candidate search
<600 bp
>600 bp
Sample Analysis &
Comparison
Krona plots,
Number of reads placed
for each marker gene
Edge PCA,
Tree visualization,
Bayes factor tests
eachinputsequencescannedagainstbothworkflows
Aaron Darling
@koadman
Erik Matsen
@ematsen
Holly Bik
@hollybik
Guillaume Jospin
@guillaumejospin
Darling AE, Jospin G, Lowe E,
Matsen FA IV, Bik HM, Eisen JA.
(2014) PhyloSift: phylogenetic
analysis of genomes and
metagenomes. PeerJ 2:e243
http://dx.doi.org/10.7717/peerj.
243
Erik Lowe
42. Whole Genome Tree of 2000 Taxa
Lang JM, Darling AE, Eisen JA (2013)
Phylogeny of Bacterial and Archaeal
Genomes Using Conserved Genes:
Supertrees and Supermatrices. PLoS
ONE 8(4): e62510. doi:10.1371/
journal.pone.0062510
Jenna Lang
@jennnomics
Aaron Darling
@koadman
44. PhyEco Markers
Phylogenetic group Genome Number Gene Number Maker Candidates
Archaea 62 145415 106
Actinobacteria 63 267783 136
Alphaproteobacteria 94 347287 121
Betaproteobacteria 56 266362 311
Gammaproteobacteria 126 483632 118
Deltaproteobacteria 25 102115 206
Epislonproteobacteria 18 33416 455
Bacteriodes 25 71531 286
Chlamydae 13 13823 560
Chloroflexi 10 33577 323
Cyanobacteria 36 124080 590
Firmicutes 106 312309 87
Spirochaetes 18 38832 176
Thermi 5 14160 974
Thermotogae 9 17037 684
Wu D, Jospin G, Eisen JA (2013) Systematic Identification of Gene Families
for Use as “Markers” for Phylogenetic and Phylogeny-Driven Ecological
Studies of Bacteria and Archaea and Their Major Subgroups. PLoS ONE
8(10): e77033. doi:10.1371/journal.pone.0077033
45. Edge PCA: Identify
lineages that explain most
variation among samples
Edge PCA - Matsen and Evans 2013
Output: Edge PCA
46. QIIME Phylotyping and Phylogenetic Ecology
296
Fig. S6. A set of 96 OTUs mainly consisting of Proteobacteria is297
compartment in the greenhouse experiment. (A) Number of OTU298
they belong to that are enriched across all rhizocompartments in the299
A subset of the Proteobacteria and the classes and families they belo300
enriched across all rhizocompartments in the greenhouse.301
https://evomics.org/2014/01/the-glories-of-the-gut-ask-a-fat-mouse/
Lesson 7:
Don’t Accept
When You Are
Defeated
52. • Leveraging an understanding of the
evolution of function to better prediction
functions
Function & Phylogeny
53. PHYLOGENENETIC PREDICTION OF GENE FUNCTION
IDENTIFY HOMOLOGS
OVERLAY KNOWN
FUNCTIONS ONTO TREE
INFER LIKELY FUNCTION
OF GENE(S) OF INTEREST
1 2 3 4 5 6
3 5
3
1A 2A 3A 1B 2B 3B
2A 1B
1A
3A
1B
2B
3B
ALIGN SEQUENCES
CALCULATE GENE TREE
1
2
4
6
CHOOSE GENE(S) OF INTEREST
2A
2A
5
3
Species 3Species 1 Species 2
1
1 2
2
2 31
1A 3A
1A 2A 3A
1A 2A 3A
4 6
4 5 6
4 5 6
2B 3B
1B 2B 3B
1B 2B 3B
ACTUAL EVOLUTION
(ASSUMED TO BE UNKNOWN)
Duplication?
EXAMPLE A EXAMPLE B
Duplication?
Duplication?
Duplication
5
METHOD
Ambiguous
Based on
Eisen, 1998
Genome Res 8:
163-167.
Phylogenomics
54. PHYLOGENENETIC PREDICTION OF GENE FUNCTION
IDENTIFY HOMOLOGS
OVERLAY KNOWN
FUNCTIONS ONTO TREE
INFER LIKELY FUNCTION
OF GENE(S) OF INTEREST
1 2 3 4 5 6
3 5
3
1A 2A 3A 1B 2B 3B
2A 1B
1A
3A
1B
2B
3B
ALIGN SEQUENCES
CALCULATE GENE TREE
1
2
4
6
CHOOSE GENE(S) OF INTEREST
2A
2A
5
3
Species 3Species 1 Species 2
1
1 2
2
2 31
1A 3A
1A 2A 3A
1A 2A 3A
4 6
4 5 6
4 5 6
2B 3B
1B 2B 3B
1B 2B 3B
ACTUAL EVOLUTION
(ASSUMED TO BE UNKNOWN)
Duplication?
EXAMPLE A EXAMPLE B
Duplication?
Duplication?
Duplication
5
METHOD
Ambiguous
Based on
Eisen, 1998
Genome Res 8:
163-167.
Phylogenomics
Lesson 9:
If you invent your
own omics word,
you are stuck with it
so use it for
branding
62. Non-Homology Predictions:
Phylogenetic Profiling
• Step 1: Search all genes in
organisms of interest against all
other genomes
• Ask: Yes or No, is each gene
found in each other species
• Cluster genes by distribution
patterns (profiles)
68. TIGR Tree of Life Project
Figure from Barton, Eisen et al. “Evolution”, CSHL Press based on Baldauf et al Tree
69. Genomic Encyclopedia of Bacteria & Archaea
Wu et al. 2009 Nature 462, 1056-1060
Figure from Barton, Eisen et al. “Evolution”, CSHL Press based on Baldauf et al Tree
70. Genomic Encyclopedia of Bacteria & Archaea
Wu et al. 2009 Nature 462, 1056-1060
Figure from Barton, Eisen et al. “Evolution”, CSHL Press based on Baldauf et al Tree
72. GEBA Cyanobacteria
Shih et al. 2013. PNAS 10.1073/pnas.1217107110
0.3
B1
B2
C1
Paulinella
Glaucophyte
Green
Red
Chromalveolates
C2
C3
A
E
F
G
B3
D
A
B
Fig.
mum
noba
78. Chlorobi
)LUPLFXWHV
Tenericutes
)XVREDFWHULD
Chrysiogenetes
Proteobacteria
)LEUREDFWHUHV
TG3
Spirochaetes
WWE1 (Cloacamonetes)
70
ZB3
093í
'HLQRFRFFXVí7KHUPXV
OP1 (Acetothermia)
Bacteriodetes
TM7
GN02 (Gracilibacteria)
SR1
BH1
OD1 (Parcubacteria)
:6
OP11 (Microgenomates)
Euryarchaeota
Micrarchaea
DSEG (Aenigmarchaea)
Nanohaloarchaea
Nanoarchaea
Cren MCG
Thaumarchaeota
Cren C2
Aigarchaeota
Cren pISA7
Cren Thermoprotei
Korarchaeota
pMC2A384 (Diapherotrites)
BACTERIA ARCHAEA
archaeal toxins (Nanoarchaea)
lytic murein transglycosylase
stringent response
(Diapherotrites, Nanoarchaea)
ppGpp
limiting
amino acids
SpotT RelA
(GTP or GDP)
+ PPi
GTP or GDP
+ATP
limiting
phosphate,
fatty acids,
carbon, iron
DksA
Expression of components
for stress response
sigma factor (Diapherotrites, Nanoarchaea)
ı4
ȕ ȕ¶
ı2ı3 ı1
-35 -10
Į17'
Į7'
51$ SROPHUDVH
oxidoretucase
+ +e- donor e- acceptor
H
1
Ribo
ADP
+
1+2
O
Reduction
Oxidation
H
1
Ribo
ADP
1+
O
2H
1$' + H 1$'++ + -
HGT from Eukaryotes (Nanoarchaea)
Eukaryota
O
+2+2
OH
1+
2+3
O
O
+2+2
1+
2+3
O
tetra-
peptide
O
+2+2
OH
1+
2+3
O
O
+2+2
1+
2+3
O
tetra-
peptide
murein (peptido-glycan)
archaeal type purine synthesis
(Microgenomates)
PurF
PurD
3XU1
PurL/Q
PurM
PurK
PurE
3XU
PurB
PurP
?
Archaea
adenine guanine
O
+ 12
+
1
1+2
1
1
H
H
1
1
1
H
H
H1 1
H
PRPP )$,$5
IMP
$,$5
A
GUA
G U
G
U
A
G
U
A U
A U
A U
Growing
AA chain
W51$*O
85. Wu et al. 2006 PLoS Biology 4: e188.
Baumannia makes vitamins and cofactors
Sulcia makes amino acids
Phylogenetic Binning
86. HiC Crosslinking Sequencing
Beitel CW, Froenicke L, Lang JM, Korf IF, Michelmore
RW, Eisen JA, Darling AE. (2014) Strain- and plasmid-
level deconvolution of a synthetic metagenome by
sequencing proximity ligation products. PeerJ 2:e415
http://dx.doi.org/10.7717/peerj.415
Table 1 Species alignment fractions. The number of reads aligning to each replicon present in the
synthetic microbial community are shown before and after filtering, along with the percent of total
constituted by each species. The GC content (“GC”) and restriction site counts (“#R.S.”) of each replicon,
species, and strain are shown. Bur1: B. thailandensis chromosome 1. Bur2: B. thailandensis chromosome
2. Lac0: L. brevis chromosome, Lac1: L. brevis plasmid 1, Lac2: L. brevis plasmid 2, Ped: P. pentosaceus,
K12: E. coli K12 DH10B, BL21: E. coli BL21. An expanded version of this table can be found in Table S2.
Sequence Alignment % of Total Filtered % of aligned Length GC #R.S.
Lac0 10,603,204 26.17% 10,269,562 96.85% 2,291,220 0.462 629
Lac1 145,718 0.36% 145,478 99.84% 13,413 0.386 3
Lac2 691,723 1.71% 665,825 96.26% 35,595 0.385 16
Lac 11,440,645 28.23% 11,080,865 96.86% 2,340,228 0.46 648
Ped 2,084,595 5.14% 2,022,870 97.04% 1,832,387 0.373 863
BL21 12,882,177 31.79% 2,676,458 20.78% 4,558,953 0.508 508
K12 9,693,726 23.92% 1,218,281 12.57% 4,686,137 0.507 568
E. coli 22,575,903 55.71% 3,894,739 17.25% 9,245,090 0.51 1076
Bur1 1,886,054 4.65% 1,797,745 95.32% 2,914,771 0.68 144
Bur2 2,536,569 6.26% 2,464,534 97.16% 3,809,201 0.672 225
Bur 4,422,623 10.91% 4,262,279 96.37% 6,723,972 0.68 369
Figure 1 Hi-C insert distribution. The distribution of genomic distances between Hi-C read pairs is
shown for read pairs mapping to each chromosome. For each read pair the minimum path length on
the circular chromosome was calculated and read pairs separated by less than 1000 bp were discarded.
The 2.5 Mb range was divided into 100 bins of equal size and the number of read pairs in each bin
was recorded for each chromosome. Bin values for each chromosome were normalized to sum to 1 and
plotted.
E. coli K12 genome were distributed in a similar manner as previously reported (Fig. 1;
(Lieberman-Aiden et al., 2009)). We observed a minor depletion of alignments spanning
the linearization point of the E. coli K12 assembly (e.g., near coordinates 0 and 4686137)
due to edge eVects induced by BWA treating the sequence as a linear chromosome rather
than circular.
10.7717/peerj.415 9/19
Figure 2 Metagenomic Hi-C associations. The log-scaled, normalized number of Hi-C read pairs
associating each genomic replicon in the synthetic community is shown as a heat map (see color scale,
blue to yellow: low to high normalized, log scaled association rates). Bur1: B. thailandensis chromosome
1. Bur2: B. thailandensis chromosome 2. Lac0: L. brevis chromosome, Lac1: L. brevis plasmid 1, Lac2:
L. brevis plasmid 2, Ped: P. pentosaceus, K12: E. coli K12 DH10B, BL21: E. coli BL21.
reference assemblies of the members of our synthetic microbial community with the same
alignment parameters as were used in the top ranked clustering (described above). We first
Figure 3 Contigs associated by Hi-C reads. A graph is drawn with nodes depicting contigs and edges
depicting associations between contigs as indicated by aligned Hi-C read pairs, with the count thereof
depicted by the weight of edges. Nodes are colored to reflect the species to which they belong (see legend)
with node size reflecting contig size. Contigs below 5 kb and edges with weights less than 5 were excluded.
Contig associations were normalized for variation in contig size.
typically represent the reads and variant sites as a variant graph wherein variant sites are
represented as nodes, and sequence reads define edges between variant sites observed in
the same read (or read pair). We reasoned that variant graphs constructed from Hi-C
data would have much greater connectivity (where connectivity is defined as the mean
path length between randomly sampled variant positions) than graphs constructed from
mate-pair sequencing data, simply because Hi-C inserts span megabase distances. Such
Figure 4 Hi-C contact maps for replicons of Lactobacillus brevis. Contact maps show the number of
Hi-C read pairs associating each region of the L. brevis genome. The L. brevis chromosome (Lac0, (A),
Chris Beitel
@datscimed
Aaron Darling
@koadman
87. Sequence Isn’t Everything
PB-PSB1
(Purple sulfur bacteria)
PB-SRB1
(Sulfate reducing bacteria)
(sulfate)
(sulfide)
Wilbanks, E.G. et al (2014). Environmental Microbiology
Lizzy Wilbanks
@lizzywilbanks
88. 12
C, 12
C14
N, 32
S
Biomass
(RGB composite)
0.044 0.080
34S-incorporation
(34S/32S ratio)
Wilbanks, E.G. et al (2014). Environmental Microbiology
Transfer of 34
S from SRB to PSB
89. Long Reads Help, A Lot
Hiseq Miseq
100-250 bp
Moleculo
2-20 kb
Pacbio RSII
2-20kb
Micky Kertesz,
Tim Blauwcamp
Meredith Ashby
Cheryl Heiner
Illumina-based
synthetic long reads”
Real-time single molecul
sequencing
(p4-c2, p5-c3)
295 Megabases 474 Megabases61 Gigabases