An overview of my bioinformatics activities during my PhD and my post-doctorate, published till 2015:
Section 1: Introduction to Metagenomics
Section 2: Taxonomic annotation algorithm. Paper https://doi.org/10.1093/bioinformatics/btq649
Section 3: Metagenomics to Retrieve a Bacteria. Paper: https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-14-7
Section 4: Comparative Genomics for Antibiotic Resistance. Paper: https://doi.org/10.1371/journal.pbio.1002104
Appendix: construction of a tailored reference
Paper: https://doi.org/10.1371/journal.pbio.1002104
Metagenomic Data Analysis: Computational Methods and ApplicationsFabio Gori
The document discusses computational methods for analyzing metagenomic data. It introduces metagenomics and the challenge of annotating short DNA sequences from microbial communities. It then describes two key approaches for metagenomic annotation: taxonomic-annotation algorithms that assign sequences to taxa based on similarity, and using genomic signatures to represent sequences as feature vectors for classification. The document focuses on different genomic signature representations for binning sequences in metagenomic data analysis.
The document discusses metagenomics analysis tools and challenges. It summarizes several metagenome analysis portals that provide computational analysis and public sample databases. It also discusses the rapid growth of metagenomic data being produced, challenges around quality control, feature identification, characterization and presentation of metagenomic data, and the need for standardized metadata and data formats. The future directions highlighted include studying strain variation, expanding metadata capture and standards, and developing improved assembly, binning and analysis methods.
This document provides an overview of a tutorial on analyzing microbiome data using 16S rRNA gene sequencing and metagenomics. The morning session covers the basics of 16S analysis including sample collection, PCR amplification of the 16S gene, clustering sequences into OTUs, assigning taxonomy, and calculating alpha and beta diversity. The assumptions and limitations of 16S analysis are also discussed. The afternoon session introduces metagenomics and compares it to 16S analysis. It covers taxonomic and functional profiling from metagenomic data as well as tools like PICRUSt for predicting gene functions. The document concludes by discussing the value of multi-omics approaches that integrate different types of microbiome data.
Meren's pirate presentation at the STAMPS course to talk about the basic concepts most binning algorithms use to bin contigs into genome bins: sequence composition, and differential coverage.
This document discusses NIST's work in developing genomic reference materials and methods to evaluate microbial genomics measurements. It describes three projects: 1) assessing genomic purity by detecting low levels of contaminants using sequencing and classification, 2) evaluating SNP calling methods using reference materials and replicates to establish confidence, and 3) developing characterized genomic reference materials for public health pathogens. The overall aim is to build an infrastructure to support genome-based characterization of microbial samples.
This document provides an overview of metagenomic analysis. It discusses collecting metagenomic data through sampling and sequencing environmental samples. It then covers various bioinformatics approaches used in metagenomic analysis such as assembly, binning, and annotation of sequencing data. Specific tools and algorithms for these approaches are also described, including reference-based and de novo assembly, compositional and similarity-based binning methods like AbundanceBin.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.jennomics
Presentation at a workshop conducted by the UC Davis Bioinformatics Core Facility: Using the Linux Command Line for Analysis of High Throughput Sequence Data, September 15-19, 2014
Introduction to 16S rRNA gene multivariate analysisJosh Neufeld
Short introductory talk on multivariate statistics for 16S rRNA gene analysis given at the 2nd Soil Metagenomics conference in Braunschweig Germany, December 2013. A previous talk had discussed quality filtering, chimera detection, and clustering algorithms.
Metagenomic Data Analysis: Computational Methods and ApplicationsFabio Gori
The document discusses computational methods for analyzing metagenomic data. It introduces metagenomics and the challenge of annotating short DNA sequences from microbial communities. It then describes two key approaches for metagenomic annotation: taxonomic-annotation algorithms that assign sequences to taxa based on similarity, and using genomic signatures to represent sequences as feature vectors for classification. The document focuses on different genomic signature representations for binning sequences in metagenomic data analysis.
The document discusses metagenomics analysis tools and challenges. It summarizes several metagenome analysis portals that provide computational analysis and public sample databases. It also discusses the rapid growth of metagenomic data being produced, challenges around quality control, feature identification, characterization and presentation of metagenomic data, and the need for standardized metadata and data formats. The future directions highlighted include studying strain variation, expanding metadata capture and standards, and developing improved assembly, binning and analysis methods.
This document provides an overview of a tutorial on analyzing microbiome data using 16S rRNA gene sequencing and metagenomics. The morning session covers the basics of 16S analysis including sample collection, PCR amplification of the 16S gene, clustering sequences into OTUs, assigning taxonomy, and calculating alpha and beta diversity. The assumptions and limitations of 16S analysis are also discussed. The afternoon session introduces metagenomics and compares it to 16S analysis. It covers taxonomic and functional profiling from metagenomic data as well as tools like PICRUSt for predicting gene functions. The document concludes by discussing the value of multi-omics approaches that integrate different types of microbiome data.
Meren's pirate presentation at the STAMPS course to talk about the basic concepts most binning algorithms use to bin contigs into genome bins: sequence composition, and differential coverage.
This document discusses NIST's work in developing genomic reference materials and methods to evaluate microbial genomics measurements. It describes three projects: 1) assessing genomic purity by detecting low levels of contaminants using sequencing and classification, 2) evaluating SNP calling methods using reference materials and replicates to establish confidence, and 3) developing characterized genomic reference materials for public health pathogens. The overall aim is to build an infrastructure to support genome-based characterization of microbial samples.
This document provides an overview of metagenomic analysis. It discusses collecting metagenomic data through sampling and sequencing environmental samples. It then covers various bioinformatics approaches used in metagenomic analysis such as assembly, binning, and annotation of sequencing data. Specific tools and algorithms for these approaches are also described, including reference-based and de novo assembly, compositional and similarity-based binning methods like AbundanceBin.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.jennomics
Presentation at a workshop conducted by the UC Davis Bioinformatics Core Facility: Using the Linux Command Line for Analysis of High Throughput Sequence Data, September 15-19, 2014
Introduction to 16S rRNA gene multivariate analysisJosh Neufeld
Short introductory talk on multivariate statistics for 16S rRNA gene analysis given at the 2nd Soil Metagenomics conference in Braunschweig Germany, December 2013. A previous talk had discussed quality filtering, chimera detection, and clustering algorithms.
Introduction to 16S Analysis with NGS - BMR GenomicsAndrea Telatin
This document provides an overview and primer on 16S amplicon sequencing and analysis for metagenomics. It discusses how 16S is a ubiquitous gene that can be used to compare microbial communities across samples, outlines common analysis steps like preprocessing, OTU picking, taxonomy assignment, and diversity metrics, and introduces two analysis tools - MEGAN and Qiime. Key advantages and limitations of the 16S amplicon approach are highlighted.
Formal languages to map Genotype to Phenotype in Natural Genomesmadalladam
The document discusses using formal language theory to model genotype to phenotype (G2P) mappings. It proposes that G2P mappings are non-linear networks rather than linear pathways, and that formal languages could be used to formally represent these networks. Specifically, it suggests using concepts from computational linguistics like context-free grammars, attribute grammars, and semantic actions to parse genetic sequences and compute their phenotypic outcomes. As an example, it presents a context-free grammar for designing genetic constructs and computing their chemical dynamics using an attribute grammar. In summary, formal languages may provide a way to rigorously define the complex non-linear relationships between genotypes and resulting phenotypes.
This document discusses DNA sequencing and phylogenetic analysis. It defines DNA sequencing as determining the order of nucleotide bases in a DNA molecule. It describes several DNA sequencing techniques like Sanger sequencing and nanopore sequencing. It explains how DNA sequencing results are used to infer phylogenetic relationships and construct phylogenetic trees showing evolutionary relationships among species. It discusses applications of DNA sequencing and phylogenetic analysis in fields like medicine, forensics, and tracing pathogen evolution.
The presentation includes preliminary information about the big data mainly metagenomic data and discussions related to the hurdles in analyzing using conventional approaches. In the later part, brief introduction about machine learning approaches using biological example for each. In the last, work done with special focus on implementation of a machine learning approach Random Forest for the functional annotation and taxonomic classification of metagenomic data.
Human genetic variation and its contribution to complex traitsgroovescience
This document summarizes a meeting discussing the human genome and human genetic variation.
The key points are:
1) In June 2000, the first draft of the human genome was announced, with fully sequenced publications in 2001. This marked a major milestone in understanding the human genome.
2) Since then, over 11 million SNPs and thousands of structural variants have been discovered through projects like HapMap and large sequencing studies. However, the majority of rare variants are still unknown.
3) Genome-wide association studies since 2006 have identified over 300 genetic loci associated with over 80 human traits and diseases. However, determining the functional consequences and molecular mechanisms remains a major challenge.
This document provides an overview of the Lab for Bioinformatics and Computational Genomics at a university. It describes that the lab has over 100 people from diverse backgrounds including engineers, scientists, technicians, geneticists and clinicians. The lab's work involves hardware/software engineering, mathematics, molecular biology and analysis of biological data through computing. Bioinformatics is defined as the application of information technology to biological data, including tasks like sequence analysis, molecular modeling, phylogeny analysis, medical applications and more. The document then discusses some of the promises and applications of genomics and bioinformatics in fields like medicine, agriculture and animal health.
Shotgun metagenomics involves collecting environmental samples, extracting DNA from the samples, sequencing the DNA using shotgun sequencing, and then analyzing the sequence data computationally. Key steps include assembling reads into longer contigs to aid analysis and annotation. While assembly works well for some datasets, challenges include repeats, low coverage of low-abundance species, and strain variation. High coverage, often 10x or more per genome, is critical for robust assembly. The amount of sequencing needed can be substantial, such as terabases of data to deeply sample microbial communities.
This document discusses various topics related to drug discovery through bioinformatics. It begins by describing how genome-wide RNAi screening in the nematode C. elegans can be used to identify genes involved in biological pathways related to diseases like type-2 diabetes. It then discusses topics like structural genomics, target identification and validation, high-throughput screening approaches and facilities, sources for screening libraries, criteria for hit and lead compounds, and computational methods used in hit identification and optimization like pharmacophore modeling and evaluating compounds against the "rule of five". Descriptors that can be used for characterizing compounds are also listed.
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun SequencesSurya Saha
Presented at Cornell Symbiosis symposium. Workflow for processing amplicon based 16S/ITS sequences as well as whole genome shotgun sequences are described. Slides include short description and links for each tool.
DISCLAIMER: This is a small subset of tools out there. No disrespect to methods not mentioned.
The document describes the Lab for Bioinformatics and Computational Genomics at UGent. The lab has over 100 staff working on bioinformatics projects, including engineers, scientists, and clinicians. The lab's work involves applying computational methods and software tools to analyze biological data and address problems in areas like genomics, molecular modeling, phylogeny, and medical applications.
This document provides an overview of genomics and metagenomics. It begins with an introduction to genomics, describing genome assembly, validation, and metabolic reconstruction. It then covers metagenomics, discussing its history, pitfalls, and potentials. Key points include that genomics analyzes the parts list of a single genome, while metagenomics analyzes the collective genomes of an entire microbial community. Metagenomics has been used to explore novel sequences from various environments, perform comparative analyses between ecosystems, and extract genomes from low-abundance species.
Microbial Metagenomics Drives a New CyberinfrastructureLarry Smarr
06.03.03
Invited Talk
School of Biological Sciences
University of California, Irvine
Title: Microbial Metagenomics Drives a New Cyberinfrastructure
Irvine, CA
The document describes a lab for bioinformatics and computational genomics at Ghent University. It has over 100 people including engineers, mathematicians, and molecular biologists. The lab uses bioinformatics approaches like sequence analysis, datamining, and computational biology to analyze large genomic datasets. One goal is developing an app for personal genomic analysis and interpretation.
Whole genome taxonomic classication for prokaryotic plant pathogensLeighton Pritchard
This document summarizes a presentation on using whole genome sequencing and average nucleotide identity (ANI) for taxonomic classification of prokaryotic plant pathogens. It discusses how classification is critical for legislation and diagnostics but current bacterial taxonomy is problematic. It presents results of ANI analyses of 34 Dickeya species and 55 Pectobacterium species which identified nine Dickeya groups including three misclassified species, and ten Pectobacterium groups with P. carotovorum and P. wasabiae each splitting into multiple potential species. Whole genome analyses are cracking the taxonomy which will have real-world consequences.
Dr. Melaku Gedil presented on genotyping in breeding programs at the Implementation of Crop Improvement Strategy of IITA. The presentation discussed strategies for crop breeding including recombining genes among genotypes and selecting superior genotypes. It also discussed marker assisted selection (MAS) and its advantages such as enabling selection at the seedling stage and accelerating line development. Key issues in implementing MAS included the need for genomic resources, cost-effective genotyping systems, high-throughput phenotyping, and accurate marker-trait association methods.
The document discusses pairwise sequence alignment, which involves mapping residues between two sequences to find conserved regions and maximize similarity score. It describes how alignment is used to infer homology and related applications. Key concepts covered include substitution matrices like PAM and BLOSUM that account for amino acid substitutability, and dot plots which provide a graphical representation of sequence similarity.
Bioinformatics emerged from the marriage of computer science and molecular biology to analyze massive amounts of biological data, like that produced by the Human Genome Project. It uses algorithms and techniques from computer science to solve problems in molecular biology, like comparing genomic sequences to understand evolution. As genomic data exploded publicly, bioinformatics was needed to efficiently store, analyze, and make sense of this information, which has applications in molecular medicine, drug development, agriculture, and more.
Introduction to 16S Analysis with NGS - BMR GenomicsAndrea Telatin
This document provides an overview and primer on 16S amplicon sequencing and analysis for metagenomics. It discusses how 16S is a ubiquitous gene that can be used to compare microbial communities across samples, outlines common analysis steps like preprocessing, OTU picking, taxonomy assignment, and diversity metrics, and introduces two analysis tools - MEGAN and Qiime. Key advantages and limitations of the 16S amplicon approach are highlighted.
Formal languages to map Genotype to Phenotype in Natural Genomesmadalladam
The document discusses using formal language theory to model genotype to phenotype (G2P) mappings. It proposes that G2P mappings are non-linear networks rather than linear pathways, and that formal languages could be used to formally represent these networks. Specifically, it suggests using concepts from computational linguistics like context-free grammars, attribute grammars, and semantic actions to parse genetic sequences and compute their phenotypic outcomes. As an example, it presents a context-free grammar for designing genetic constructs and computing their chemical dynamics using an attribute grammar. In summary, formal languages may provide a way to rigorously define the complex non-linear relationships between genotypes and resulting phenotypes.
This document discusses DNA sequencing and phylogenetic analysis. It defines DNA sequencing as determining the order of nucleotide bases in a DNA molecule. It describes several DNA sequencing techniques like Sanger sequencing and nanopore sequencing. It explains how DNA sequencing results are used to infer phylogenetic relationships and construct phylogenetic trees showing evolutionary relationships among species. It discusses applications of DNA sequencing and phylogenetic analysis in fields like medicine, forensics, and tracing pathogen evolution.
The presentation includes preliminary information about the big data mainly metagenomic data and discussions related to the hurdles in analyzing using conventional approaches. In the later part, brief introduction about machine learning approaches using biological example for each. In the last, work done with special focus on implementation of a machine learning approach Random Forest for the functional annotation and taxonomic classification of metagenomic data.
Human genetic variation and its contribution to complex traitsgroovescience
This document summarizes a meeting discussing the human genome and human genetic variation.
The key points are:
1) In June 2000, the first draft of the human genome was announced, with fully sequenced publications in 2001. This marked a major milestone in understanding the human genome.
2) Since then, over 11 million SNPs and thousands of structural variants have been discovered through projects like HapMap and large sequencing studies. However, the majority of rare variants are still unknown.
3) Genome-wide association studies since 2006 have identified over 300 genetic loci associated with over 80 human traits and diseases. However, determining the functional consequences and molecular mechanisms remains a major challenge.
This document provides an overview of the Lab for Bioinformatics and Computational Genomics at a university. It describes that the lab has over 100 people from diverse backgrounds including engineers, scientists, technicians, geneticists and clinicians. The lab's work involves hardware/software engineering, mathematics, molecular biology and analysis of biological data through computing. Bioinformatics is defined as the application of information technology to biological data, including tasks like sequence analysis, molecular modeling, phylogeny analysis, medical applications and more. The document then discusses some of the promises and applications of genomics and bioinformatics in fields like medicine, agriculture and animal health.
Shotgun metagenomics involves collecting environmental samples, extracting DNA from the samples, sequencing the DNA using shotgun sequencing, and then analyzing the sequence data computationally. Key steps include assembling reads into longer contigs to aid analysis and annotation. While assembly works well for some datasets, challenges include repeats, low coverage of low-abundance species, and strain variation. High coverage, often 10x or more per genome, is critical for robust assembly. The amount of sequencing needed can be substantial, such as terabases of data to deeply sample microbial communities.
This document discusses various topics related to drug discovery through bioinformatics. It begins by describing how genome-wide RNAi screening in the nematode C. elegans can be used to identify genes involved in biological pathways related to diseases like type-2 diabetes. It then discusses topics like structural genomics, target identification and validation, high-throughput screening approaches and facilities, sources for screening libraries, criteria for hit and lead compounds, and computational methods used in hit identification and optimization like pharmacophore modeling and evaluating compounds against the "rule of five". Descriptors that can be used for characterizing compounds are also listed.
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun SequencesSurya Saha
Presented at Cornell Symbiosis symposium. Workflow for processing amplicon based 16S/ITS sequences as well as whole genome shotgun sequences are described. Slides include short description and links for each tool.
DISCLAIMER: This is a small subset of tools out there. No disrespect to methods not mentioned.
The document describes the Lab for Bioinformatics and Computational Genomics at UGent. The lab has over 100 staff working on bioinformatics projects, including engineers, scientists, and clinicians. The lab's work involves applying computational methods and software tools to analyze biological data and address problems in areas like genomics, molecular modeling, phylogeny, and medical applications.
This document provides an overview of genomics and metagenomics. It begins with an introduction to genomics, describing genome assembly, validation, and metabolic reconstruction. It then covers metagenomics, discussing its history, pitfalls, and potentials. Key points include that genomics analyzes the parts list of a single genome, while metagenomics analyzes the collective genomes of an entire microbial community. Metagenomics has been used to explore novel sequences from various environments, perform comparative analyses between ecosystems, and extract genomes from low-abundance species.
Microbial Metagenomics Drives a New CyberinfrastructureLarry Smarr
06.03.03
Invited Talk
School of Biological Sciences
University of California, Irvine
Title: Microbial Metagenomics Drives a New Cyberinfrastructure
Irvine, CA
The document describes a lab for bioinformatics and computational genomics at Ghent University. It has over 100 people including engineers, mathematicians, and molecular biologists. The lab uses bioinformatics approaches like sequence analysis, datamining, and computational biology to analyze large genomic datasets. One goal is developing an app for personal genomic analysis and interpretation.
Whole genome taxonomic classication for prokaryotic plant pathogensLeighton Pritchard
This document summarizes a presentation on using whole genome sequencing and average nucleotide identity (ANI) for taxonomic classification of prokaryotic plant pathogens. It discusses how classification is critical for legislation and diagnostics but current bacterial taxonomy is problematic. It presents results of ANI analyses of 34 Dickeya species and 55 Pectobacterium species which identified nine Dickeya groups including three misclassified species, and ten Pectobacterium groups with P. carotovorum and P. wasabiae each splitting into multiple potential species. Whole genome analyses are cracking the taxonomy which will have real-world consequences.
Dr. Melaku Gedil presented on genotyping in breeding programs at the Implementation of Crop Improvement Strategy of IITA. The presentation discussed strategies for crop breeding including recombining genes among genotypes and selecting superior genotypes. It also discussed marker assisted selection (MAS) and its advantages such as enabling selection at the seedling stage and accelerating line development. Key issues in implementing MAS included the need for genomic resources, cost-effective genotyping systems, high-throughput phenotyping, and accurate marker-trait association methods.
The document discusses pairwise sequence alignment, which involves mapping residues between two sequences to find conserved regions and maximize similarity score. It describes how alignment is used to infer homology and related applications. Key concepts covered include substitution matrices like PAM and BLOSUM that account for amino acid substitutability, and dot plots which provide a graphical representation of sequence similarity.
Bioinformatics emerged from the marriage of computer science and molecular biology to analyze massive amounts of biological data, like that produced by the Human Genome Project. It uses algorithms and techniques from computer science to solve problems in molecular biology, like comparing genomic sequences to understand evolution. As genomic data exploded publicly, bioinformatics was needed to efficiently store, analyze, and make sense of this information, which has applications in molecular medicine, drug development, agriculture, and more.
Major characteristics used in taxonomy include classical characteristics like morphology and physiology, as well as molecular characteristics like comparing protein and nucleic acid sequences. Nucleic acid composition looks at G-C content, hybridization compares sequence homology, and sequencing directly compares genomes. Phylogenetic trees show evolutionary relationships, and different cellular components like rRNA can serve as molecular chronometers indicating relatedness. Domains and kingdoms are two systems for classifying organisms based on rRNA and other studies.
The document discusses using genomic context analysis and high-throughput data to construct and interpret networks of functional associations between genes and proteins. It describes the STRING database, which uses genomic context evidence from 110 species to predict functional links. It also discusses integrating various high-throughput data types, like protein-protein interaction data and gene expression data from microarrays, to improve the coverage and accuracy of predicted functional associations in STRING. Normalization methods and singular value decomposition are used to analyze and combine expression data from multiple experiments.
K-mers in metagenomics
K-mers play a critical role in the exploration of metagenomic data. They have been widely used to assign taxonomic attributions to the short genomic fragments characteristic of shotgun (metagenomic) sequencing. These approaches provide an assembly-free method for profiling microbial communities, and have helped elucidate the factors driving microbial community composition across biogeochemical gradients. Advances in sequencing technology are now making it cost-effective to sequence microbial communities at sufficient depths to allow for the assembly of high-quality contigs. This has made it possible to adopt k-mer based approaches to enable reliable binning of contigs originating from a single microbial population within a community. In this session, I will present both an overview of how k-mers can be used to assign taxonomic attributions to short metagenomic reads, and discuss how these approaches have advanced to a point where population genomes can be recovered en masse from even complex microbial communities.
Correcting bias and variation in small RNA sequencing for optimal (microRNA) ...Christos Argyropoulos
Presentation given about the Generalized Additive Model Location, Scale and Shape (GAMLSS) methodology for the analysis of small RNA sequencing data and the potential of microRNAs as biomarkers for kidney and cardiometabolic diseases
Visual Exploration of Clinical and Genomic Data for Patient StratificationNils Gehlenborg
Talk presented at the Simons Foundation Biotech Symposium "Complex Data Visualization: Approach and Application" (12 September 2014)
http://www.simonsfoundation.org/event/complex-data-visualization-approach-and-application/
In this talk I describe how we integrated a sophisticated computational framework directly into the StratomeX visualization technique to enable rapid exploration of tens of thousands of stratifications in cancer genomics data, creating a unique and powerful tool for the identification and characterization of tumor subtypes. The tool can handle a wide range of genomic and clinical data types for cohorts with hundreds of patients. StratomeX also provides direct access to comprehensive data sets generated by The Cancer Genome Atlas Firehose analysis pipeline.
http://stratomex.caleydo.org
The document describes a study that used bioinformatics tools to identify and analyze LTR retrotransposons in the chimpanzee genome. A total of 2056 LTR retrotransposon hits were identified using the program LTR_STRUC. Of these, 97 contained a reverse transcriptase domain. Phylogenetic analysis was performed on the 23 elements containing conserved reverse transcriptase motifs to classify the elements into classes.
This document summarizes a new technique and Python package called TCRpower for quantifying the detection power of T-cell receptor sequencing methods using spike-in standards. TCRpower uses a negative binomial model to estimate detection probabilities of target T-cell receptors based on sequencing read counts. It calibrates this model using spike-in controls containing known T-cell receptor sequences added at defined concentrations. Results from applying TCRpower to PCR-based T-cell receptor sequencing data show it can reliably detect clonotypes down to a frequency of 10-6 but has higher variability for rarer clonotypes below 300 per million RNA. TCRpower improves method selection, optimization and reproducibility for T-cell receptor sequencing.
Contents:
What does sequence mean?
Examples of sequences
Sequence Homology
Sequence Alignment
What is the use of sequence alignment?
Alignment methods
Tools for Sequence Alignment
FASTA Format
BLAST
Principle of BLAST
Variants of BLAST Program
BLAST input
BLAST output
Multiple sequence alignment
What is the use of multiple alignments?
Multiple Alignment Method
Tool for multiple alignments
ClustalW input
ClustalW output
E (Expectation) value
Demerits of progressive alignment
The document discusses using SNPs (single nucleotide polymorphisms) to help identify candidate genes associated with quantitative traits. It presents SNPit, a database that integrates data from Ensembl, dbSNP and Perlegen to rank SNPs based on differences between resistant and susceptible mouse strains. SNPit supports exploratory analysis of large genomic regions to help focus candidate gene searches for traits like disease susceptibility. The goal is to complement existing methods and automate parts of the process to accelerate disease gene identification.
Rapid and accurate Cancer somatic mutation profiling with the qBiomarker Soma...QIAGEN
QIAGEN has developed real-time PCR-based qBiomarker Somatic Mutation PCR Arrays for pathway- and disease-focused mutation profiling. By combining allele-specific amplification and 5' hydrolysis probe detection, the PCR assays on these arrays detect as little as 0.01% somatic mutation in a background of wild-type genomic DNA. These assays have consistent and reliable mutation detection performance in different sample types (including fresh, frozen, or formalin-fixed samples), and with varying sample quality. In application examples, the PCR-based mutation detection results are consistent with Pyrosequencing results for the same samples. The qBiomarker Somatic Mutation PCR Arrays, combining laboratory-verified assays, comprehensive content, and integrated data analysis software, are highly suited for identifying somatic mutations in the context of biological pathways and diseases.
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Natalio Krasnogor
These slides are part of a presentation I gave on March 2010 at the BioInformatics and Genome Research Open Club at the Weizmann Institute of Science, Israel.
In these slides my student and I describe two web-applications for microarray and gene/protein set analysis,
ArrayMining.net and TopoGSA. These use ensemble and consensus methods as well as the
possibility of modular combinations of different analysis techniques for an integrative view of
(microarray-based) gene sets, interlinking transcriptomics with proteomics data sources. This integrative process uses tools from different fields, e.g. statistics, optimisation and network
topological studies. As an example for these integrative techniques, we use a microarray
consensus-clustering approach based on Simulated Annealing, which is part of the ArrayMining.net
Class Discovery Analysis module, and show how this approach can be combined in a modular
fashion with a prior gene set analysis. The results reveal that improved cluster validity indices can be obtained by merging the two methods, and provide pointers to distinct sub-classes within pre-defined tumour categories for a breast cancer dataset by the Nottingham Queens Medical Centre.
In the second part of the talk, I show how results from a supervised
microarray feature selection analysis on ArrayMining.net can be investigated in further detail with
TopoGSA, a new web-tool for network topological analysis of gene/protein sets mapped on a
comprehensive human protein-protein interaction network. I discuss results from a TopoGSA
analysis of the complete set of genes currently known to be mutated in cancer.
This document summarizes the characterization of 10 epitope-specific CD8+ T cell receptor repertoires from over 4,600 single cells. Key findings include quantifying gene segment usage, epitope selection, TCR similarity using TCRdist, repertoire diversity with TCRdiv, and developing a distance-based classifier to assign unobserved TCR. The work demonstrates that predictive models for TCR-pMHC recognition may be possible despite tremendous TCR diversity, with potential applications to analyze clinical T cell receptor repertoire data.
Usual Questions with Unusual Answers: Application of Multi-class Supervised A...Data Con LA
Data Con LA 2020
Description
In the field of machine learning, it is well known that supervised problems can be one of two categories: classification or regression. Within the context of classification, several metrics and graphs used to assess the performance of a model only work in the context of a classification problem that computes the decision boundary between two classes (binary classification). With a greater adoption of machine learning, organizations now find themselves determining decision boundaries between several classes (multiclass). The usual question that arises is, how can one set up a multi-class problem and assess its performance? Although expansions on binary performance metrics do exist for this situation, there are a number of challenges worth considering. Suffering from limitations such as insufficient data samples and class imbalance, multi-class experiments can be unreliable for several machine learning problems. Developing a work-around, we compare and contrast several approaches to re-designing a multi-classification into binary classification. We further elucidate the best experimental design for assessing the final decisions of our model (s). The experiments for this case study analysis are applied to determine the taxonomic levels of several COVID-19 viral genomes to identify the pathogenic strains based on digital signal and chaos-inspired features.
Talk Main Points:
*What is multi-class classification?
*Compare and contrast the performance of multi-class and binary class problems
*Transforming a multi-class problem into a binary class problem
*Assessing limitations of each transformation approach in the process of COVID-19 viral taxonomy classification
Speaker
Rishov Chatterjee, City of Hope, Data Scientist
A computational framework for large-scale analysis of TCRβ immune repertoire ...Thermo Fisher Scientific
TCRβ immune repertoire analysis by next-generation sequencing is emerging as a valuable tool for research studies of the tumor microenvironment and potential immune responses to cancer immunotherapy. Generation of insight from immune repertoire profiling often requires comparative analysis of immune repertoires across research sample cohorts representing immune responses to defined antigens or immunomodulatory agents. Here we describe the development of a computational framework enabling large-scale comparative analysis of immune repertoire data on cloud-based infrastructure.
Microarrays allow researchers to analyze gene expression across thousands of genes simultaneously. DNA probes are arrayed on a small glass or nylon slide, and labeled mRNA from samples is hybridized to the probes. Fluorescent scanning detects which genes are expressed. Data analysis includes normalization, distance metrics, clustering, and visualization to group genes with similar expression profiles and identify patterns of co-regulated genes. Microarrays enable functional genomics studies of development, disease, response to drugs or environmental factors, and more.
This document discusses various methods for normalization of RNA-seq read count data, including RPKM/FPKM, TMM, Limma voom, and TPM. It provides explanations of each method and how they aim to correct for differences in sequencing depth, transcript length, and transcript pool composition between samples. The document also provides examples of calculating RPKM, TPM, and comparing the two methods. Lastly, it discusses using tools like RSEM, Bowtie, and EBSeq for determining differentially expressed genes from RNA-seq data through a count-based strategy.
Similar to Metagenomic Data Analysis and Microbial Genomics (20)
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
The technology uses reclaimed CO₂ as the dyeing medium in a closed loop process. When pressurized, CO₂ becomes supercritical (SC-CO₂). In this state CO₂ has a very high solvent power, allowing the dye to dissolve easily.
Immersive Learning That Works: Research Grounding and Paths ForwardLeonel Morgado
We will metaverse into the essence of immersive learning, into its three dimensions and conceptual models. This approach encompasses elements from teaching methodologies to social involvement, through organizational concerns and technologies. Challenging the perception of learning as knowledge transfer, we introduce a 'Uses, Practices & Strategies' model operationalized by the 'Immersive Learning Brain' and ‘Immersion Cube’ frameworks. This approach offers a comprehensive guide through the intricacies of immersive educational experiences and spotlighting research frontiers, along the immersion dimensions of system, narrative, and agency. Our discourse extends to stakeholders beyond the academic sphere, addressing the interests of technologists, instructional designers, and policymakers. We span various contexts, from formal education to organizational transformation to the new horizon of an AI-pervasive society. This keynote aims to unite the iLRN community in a collaborative journey towards a future where immersive learning research and practice coalesce, paving the way for innovative educational research and practice landscapes.
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...AbdullaAlAsif1
The pygmy halfbeak Dermogenys colletei, is known for its viviparous nature, this presents an intriguing case of relatively low fecundity, raising questions about potential compensatory reproductive strategies employed by this species. Our study delves into the examination of fecundity and the Gonadosomatic Index (GSI) in the Pygmy Halfbeak, D. colletei (Meisner, 2001), an intriguing viviparous fish indigenous to Sarawak, Borneo. We hypothesize that the Pygmy halfbeak, D. colletei, may exhibit unique reproductive adaptations to offset its low fecundity, thus enhancing its survival and fitness. To address this, we conducted a comprehensive study utilizing 28 mature female specimens of D. colletei, carefully measuring fecundity and GSI to shed light on the reproductive adaptations of this species. Our findings reveal that D. colletei indeed exhibits low fecundity, with a mean of 16.76 ± 2.01, and a mean GSI of 12.83 ± 1.27, providing crucial insights into the reproductive mechanisms at play in this species. These results underscore the existence of unique reproductive strategies in D. colletei, enabling its adaptation and persistence in Borneo's diverse aquatic ecosystems, and call for further ecological research to elucidate these mechanisms. This study lends to a better understanding of viviparous fish in Borneo and contributes to the broader field of aquatic ecology, enhancing our knowledge of species adaptations to unique ecological challenges.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
Or: Beyond linear.
Abstract: Equivariant neural networks are neural networks that incorporate symmetries. The nonlinear activation functions in these networks result in interesting nonlinear equivariant maps between simple representations, and motivate the key player of this talk: piecewise linear representation theory.
Disclaimer: No one is perfect, so please mind that there might be mistakes and typos.
dtubbenhauer@gmail.com
Corrected slides: dtubbenhauer.com/talks.html
Phenomics assisted breeding in crop improvementIshaGoswami9
As the population is increasing and will reach about 9 billion upto 2050. Also due to climate change, it is difficult to meet the food requirement of such a large population. Facing the challenges presented by resource shortages, climate
change, and increasing global population, crop yield and quality need to be improved in a sustainable way over the coming decades. Genetic improvement by breeding is the best way to increase crop productivity. With the rapid progression of functional
genomics, an increasing number of crop genomes have been sequenced and dozens of genes influencing key agronomic traits have been identified. However, current genome sequence information has not been adequately exploited for understanding
the complex characteristics of multiple gene, owing to a lack of crop phenotypic data. Efficient, automatic, and accurate technologies and platforms that can capture phenotypic data that can
be linked to genomics information for crop improvement at all growth stages have become as important as genotyping. Thus,
high-throughput phenotyping has become the major bottleneck restricting crop breeding. Plant phenomics has been defined as the high-throughput, accurate acquisition and analysis of multi-dimensional phenotypes
during crop growing stages at the organism level, including the cell, tissue, organ, individual plant, plot, and field levels. With the rapid development of novel sensors, imaging technology,
and analysis methods, numerous infrastructure platforms have been developed for phenotyping.
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Leonel Morgado
Current descriptions of immersive learning cases are often difficult or impossible to compare. This is due to a myriad of different options on what details to include, which aspects are relevant, and on the descriptive approaches employed. Also, these aspects often combine very specific details with more general guidelines or indicate intents and rationales without clarifying their implementation. In this paper we provide a method to describe immersive learning cases that is structured to enable comparisons, yet flexible enough to allow researchers and practitioners to decide which aspects to include. This method leverages a taxonomy that classifies educational aspects at three levels (uses, practices, and strategies) and then utilizes two frameworks, the Immersive Learning Brain and the Immersion Cube, to enable a structured description and interpretation of immersive learning cases. The method is then demonstrated on a published immersive learning case on training for wind turbine maintenance using virtual reality. Applying the method results in a structured artifact, the Immersive Learning Case Sheet, that tags the case with its proximal uses, practices, and strategies, and refines the free text case description to ensure that matching details are included. This contribution is thus a case description method in support of future comparative research of immersive learning cases. We then discuss how the resulting description and interpretation can be leveraged to change immersion learning cases, by enriching them (considering low-effort changes or additions) or innovating (exploring more challenging avenues of transformation). The method holds significant promise to support better-grounded research in immersive learning.
aziz sancar nobel prize winner: from mardin to nobel
Metagenomic Data Analysis and Microbial Genomics
1. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Metagenomic Data Analysis and Microbial
Genomics
Fabio Gori
Intelligent Systems, Institute for Computing and Information Sciences
in collaboration with
Department of Microbiology
Radboud University Nijmegen
The Netherlands
22
nd May 2015
2. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Table of Contents
1 Introduction to Metagenomics
2 Taxonomic-annotation Algorithms
3 Metagenomics to Retrieve a Bacteria
4 Comparative Genomics for Antibiotic Resistance
5 Appendix: construction of a tailored reference
3. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Table of Contents
1 Introduction to Metagenomics
2 Taxonomic-annotation Algorithms
3 Metagenomics to Retrieve a Bacteria
4 Comparative Genomics for Antibiotic Resistance
5 Appendix: construction of a tailored reference
4. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
What is Metagenomics?
Metagenomics:
study of genomic
imformation obtained
directly from microbial
communities
Why?
99% microbes
cannot be sequenced
Understand interactions
between organisms
Human microbiota
5. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
What is Metagenomics?
Metagenomics:
study of genomic
imformation obtained
directly from microbial
communities
Why?
99% microbes
cannot be sequenced
Understand interactions
between organisms
Human microbiota
6. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
What is Metagenomics?
Metagenomics:
study of genomic
imformation obtained
directly from microbial
communities
Why?
99% microbes
cannot be sequenced
Understand interactions
between organisms
Human microbiota
7. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
What is Metagenomics?
Metagenomics:
study of genomic
imformation obtained
directly from microbial
communities
Why?
99% microbes
cannot be sequenced
Understand interactions
between organisms
Human microbiota
8. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
What kind of data? A meta. . . jigsaw puzzle
Reads
of multiple microbes
Original pictures are
unknown
Pieces are similar
Biased abundance of pieces
9. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Annotation: discovering the original pictures of the puzzles
Assign each read
to an organism or
to a taxonomic identier
10. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Taxonomy: a biological classication
Linnean taxonomy:
Formal system for classifying and naming
living things
Based on a simple hierarchical structure
Similar elements are grouped together
Rank: level in the hierarchy (left)
Taxon: unit of the hierarchy
(group of similar living things)
11. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Table of Contents
1 Introduction to Metagenomics
2 Taxonomic-annotation Algorithms
3 Metagenomics to Retrieve a Bacteria
4 Comparative Genomics for Antibiotic Resistance
5 Appendix: construction of a tailored reference
12. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Lowest Common Ancestor (LCA) Algorithm
For each read r of the metagenome:
1 Compare r with reference sequences (e.g. with BLASTX)
2 Assign r to the lowest common taxonomic ancestor
of the matching species Hi 's
Example
LCA
H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H11 H12
13. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
LCA: Pros and Cons
Pros:
Higher accuracy than BLASTX best hit
Assign to taxa is more realistic
(with short reads)
Cons:
Few reads at low ranks
Many unassigned reads
How can we improve it?
14. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
LCA: Pros and Cons
Pros:
Higher accuracy than BLASTX best hit
Assign to taxa is more realistic
(with short reads)
Cons:
Few reads at low ranks
Many unassigned reads
How can we improve it?
15. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
MTR: Multiple Taxonomic Rank based clustering
Goal: Taxonomic Annotation of Short
Metagenomics reads (rank-level)
Assign from the highest rank
to the lowest feasible rank
Assignments of reads are
dependent on each other
16. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
MTR Algorithm scheme: top-down strategy
1 Compare reads R with reference proteins
(we used BLASTX and NCBI-NR database)
2 For each rank j (from the highest to the lowest):
1 T ← {taxa at rank j of proteins matching R}
2 Annotate by clustering R in clusters Ci
each Ci corresponds to a taxon ti ∈ T
3 Remove from R reads with incoherent classication
(w.r.t. higher ranks classications)
3 For each rank j (from the lowest to the highest):
1 Majority Vote on clusters' intersections at rank j
2 Make higher ranks classications coherent with the Majority
Vote results
17. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
MTR Algorithm scheme: top-down strategy
1 Compare reads R with reference proteins
(we used BLASTX and NCBI-NR database)
2 For each rank j (from the highest to the lowest):
1 T ← {taxa at rank j of proteins matching R}
2 Annotate by clustering R in clusters Ci
each Ci corresponds to a taxon ti ∈ T
3 Remove from R reads with incoherent classication
(w.r.t. higher ranks classications)
3 For each rank j (from the lowest to the highest):
1 Majority Vote on clusters' intersections at rank j
2 Make higher ranks classications coherent with the Majority
Vote results
18. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
MTR Algorithm scheme: top-down strategy
1 Compare reads R with reference proteins
(we used BLASTX and NCBI-NR database)
2 For each rank j (from the highest to the lowest):
1 T ← {taxa at rank j of proteins matching R}
2 Annotate by clustering R in clusters Ci
each Ci corresponds to a taxon ti ∈ T
3 Remove from R reads with incoherent classication
(w.r.t. higher ranks classications)
3 For each rank j (from the lowest to the highest):
1 Majority Vote on clusters' intersections at rank j
2 Make higher ranks classications coherent with the Majority
Vote results
19. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
MTR: Annotation via combinatorial optimization
For each rank j: For each taxon ti of rank j:
Create cluster Ci ⊆ R of reads similar to taxon ti
Set Covering Problem
Select collection of clusters (taxa) s.t.
No sequence is left outside
Minimal number of selected clusters
If Ci is selected, sequences of Ci will be assigned to ti
Example:
C1 C2 C3 C4 C5 C6
s1 • • •
s2 • •
s3 • •
s4 • • •
s5 • •
s6 • •
s7 • • •
s8 • •
s9 • •
s10 • •
→
Clustering Solution:
C1 C2 C3 C4 C5 C6
s1 • • •
s2 • •
s3 • •
s4 • • •
s5 • •
s6 • •
s7 • • •
s8 • •
s9 • •
s10 • •
20. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
MTR: Annotation via combinatorial optimization
For each rank j: For each taxon ti of rank j:
Create cluster Ci ⊆ R of reads similar to taxon ti
Set Covering Problem
Select collection of clusters (taxa) s.t.
No sequence is left outside
Minimal number of selected clusters
If Ci is selected, sequences of Ci will be assigned to ti
Example:
C1 C2 C3 C4 C5 C6
s1 • • •
s2 • •
s3 • •
s4 • • •
s5 • •
s6 • •
s7 • • •
s8 • •
s9 • •
s10 • •
→
Clustering Solution:
C1 C2 C3 C4 C5 C6
s1 • • •
s2 • •
s3 • •
s4 • • •
s5 • •
s6 • •
s7 • • •
s8 • •
s9 • •
s10 • •
21. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
MTR vs LCA: MTR better wrt to quantity
MTR annotates more reads than LCA
Simulated data: MTR 8% 37% more reads
At rank Genus: 28% 89%
Real-life data: MTR 15% 30% more reads
At rank Species: 120% 208%
Experiments: 12 simulated data and 3 real life data (100bp reads)
22. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
MTR vs LCA: LCA better accuracy
Accuracy and Number of reads assigned (for each rank)
Rank MTR (#of reads) LCA (#of reads)
Kingdom 100.00 (166,948) 99.99 (155,263)
Phylum 99.86 (166,948) 99.93 (155,258)
Class 99.73 (166,936) 99.81 (141,829)
Order 97.67 (166,148) 98.14 (115,732)
Family 97.62 (165,231) 98.04 (110,488)
Genus 97.42 (140,476) 98.35 (110,139)
Table: Data name: M3, Coverage 4X, Tot reads:166,978
Rank MTR (#of reads) LCA (#of reads)
Kingdom 95.07 (88,537) 94.66 (73,176)
Phylum 93.21 (88,537) 92.57 (73,169)
Class 89.25 (87,635) 88.98 (60,294)
Order 89.24 (85,657) 88.44 (57,373)
Family 77.35 (81,366) 81.84 (48,760)
Genus 61.36 (77,307) 74.60 (40,823)
Table: Data name: M2, Coverage 1X, Tot reads:288,730
23. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
MTR vs LCA: MTR better population distribution
Population distributions (rank Genus) of M2, coverage 0.1X
25. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Conclusions
MTR outperforms LCA in two ways:
More sequences annotated
especially at low ranks
Better estimate of
population distribution
LCA tends to be more accurate
26. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Conclusions
MTR outperforms LCA in two ways:
More sequences annotated
especially at low ranks
Better estimate of
population distribution
LCA tends to be more accurate
27. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Future Developments
Replace BLASTX with composition-based
similarity measure
Additional constraints of cluster selection
e.g. consistent coverage depth on proteins
or constraints on genome location coverage
28. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Table of Contents
1 Introduction to Metagenomics
2 Taxonomic-annotation Algorithms
3 Metagenomics to Retrieve a Bacteria
4 Comparative Genomics for Antibiotic Resistance
5 Appendix: construction of a tailored reference
29. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Metagenomic sequencing to acquire an organism
30. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Metagenomic sequencing to acquire an organism
Candidatus Brocadia
fulgida
Brocadia genome had not been
previously sequenced
31. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Metagenomic sequencing to acquire an organism
Candidatus Brocadia
fulgida
Brocadia genome had not been
previously sequenced
Sequencing platforms
(mean read length):
SangerShotgun (800bp)
SangerFosmid (800bp)
454 GS20 (200bp)
First standard annotation:
Reads are assigned to
BLASTX best hit
Reads assigned to Brocadia
if best hit is Kuenenia
(Kuenenia is close relative
of Brocadia)
32. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Metagenomic sequencing to acquire an organism
Candidatus Brocadia
fulgida
Brocadia genome had not been
previously sequenced
Sequencing platforms
(mean read length):
SangerShotgun (800bp)
SangerFosmid (800bp)
454 GS20 (200bp)
First standard annotation:
Reads are assigned to
BLASTX best hit
Reads assigned to Brocadia
if best hit is Kuenenia
(Kuenenia is close relative
of Brocadia)
33. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Why FISH analysis and BLASTX annotation do not agree?
34. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
80% of the cells are Brocadia, but. . .
Brocadia seems underrepresented
Are we sure?
Can we still extract signicant information?
Shotgun Fosmid 454
Brocadia reads 9.68% 13.76% 12.92%
Brocadia bp 9.76% 14.33% 11.34%
Let's do some composition-based analyses. . .
35. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
80% of the cells are Brocadia, but. . .
Brocadia seems underrepresented
Are we sure?
Can we still extract signicant information?
Shotgun Fosmid 454
Brocadia reads 9.68% 13.76% 12.92%
Brocadia bp 9.76% 14.33% 11.34%
Let's do some composition-based analyses. . .
36. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Dierent point of view: GC content
[ Bernaola-Galvan et al., Gene, 2004 ]
Dierent organisms can have
dierent GC content
(16.6% - 74.9%)
If genome is partitioned in
equally sized, non-overlapping
sequences:
GC content has normal
distribution (approximately)
Distribution is centered on
organism GC content
38. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
We saw that Brocadia is underrepresented. . .
How can we cope with that?
39. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Sets of well-recovered Kuenenia ORFs dier
Technologies:
Shotgun (Sanger):
Fosmid (Sanger):
454:
Extended Venn-diagram of Brocadia Open Reading Frames
retrieved for 80% of their length
40. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Depth of coverage: correlation on the same ORF
Shotgun Fosmid Shotgun 454 Fosmid 454
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
−1, −0.7
−0.7, −0.3
−0.3, 0
0, 0.3
0.3, 0.7
0.7, 1
Correlation
Similar: Dierent:
41. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Future Developments
Tuning sequencing for the specic
community
Integration of composition-based analysis
and BLASTX annotation
42. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Future Developments
Tuning sequencing for the specic
community
Integration of composition-based analysis
and BLASTX annotation
43. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Table of Contents
1 Introduction to Metagenomics
2 Taxonomic-annotation Algorithms
3 Metagenomics to Retrieve a Bacteria
4 Comparative Genomics for Antibiotic Resistance
5 Appendix: construction of a tailored reference
44. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
The antibiotic alarm, Nature, 14 March 2013
Rise of
resistance
(inevitable)
Decline of
development
(economics)
45. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Waiting for new drugs. . . How can we cope with it?
Multi-drug
treatments
New therapies
(dosage, duration)
Personalised medicine
(e.g. infecting strain,
patient PK/PD,
patient genotype)
46. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Idea: Drug Switching
Experiments:
Treatments:
Sequential switch drug
50%50% cocktail
Control no drugs
Protocol:
For each season
bacteria grow
in liquid medium
with drug
1% bacteria transfer
3 replicates
Duration: 96 hours
8 seasons of 12 hours
Drugs: Doxycycline,
Erythromycin
Sequencing: after 24h and 96h
18 datasets (red border)
47. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
My Role
Construct annotated reference genome [custom pipeline]
For each replicate, identify:
Structural Variations (SVs)
[Pindel]
Copy Number Variations (CNVs)
[CNVnator]
Single Nucleotide Polymorphisms (SNPs)
[VarScan]
48. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Results CNV: 412kb duplicated region at 96 hours
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5 Mb
96 hours
Control
ERY/DOX
50−50
ampE
rrnH
paoC
tauA
ybbJ
mdtG
atoB
rrnG
yqhC
rng
rrnD
rrnC
ubiDrrnA
rh
aM
rrnB
rrnE
slt
Normalised Coverage (1000 bins)
Mean +1
Eux-pump duplications (!)
This region includes the
multidrug eux pump
AcrRAB-TolC
[Peña-Miller et al, PLOS Biology, 2013]
24 96
Time
1
2CoverageRatioInside/OutsideDuplicatedRegion
Dox/Ery
p 0.0001
24 96
Time
50%-50%
p 0.0001
24 96
Time
Control
p = 0.303
(p-value is for t-test)
49. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Conclusion
Sequential treatments work well in vitro when cocktail fail
Genomics: antibiotics prevent mutations
Futher developments (omics):
Phage role in region duplication
Timing of region duplication
NGS of additional treatments
Transcriptomics
50. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Table of Contents
1 Introduction to Metagenomics
2 Taxonomic-annotation Algorithms
3 Metagenomics to Retrieve a Bacteria
4 Comparative Genomics for Antibiotic Resistance
5 Appendix: construction of a tailored reference
51. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Standard approach: de novo assembly annotation
Solve the jigsaw puzzle
Functional annotation
Done with software and
manual work
Problems (common)
Errors:
Repetitive regions
misassembled
Wrong order/orientation
Annotation quality
Fragmentation
Quality depends on
timemoney
2014: Automated genome
assembly for less than $1,000
[KorenPhillippy, Curr Opin Microbiol, 2015]
52. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
The alternative: tailored reference
Take the reference genome
of a close relative
Modify it according to
sequencing data
Import annotation from
reference
Pros
Less fragmentation
Higher quality
Better annotation
Cons
You need a close relative
Visually check steps
Ad hoc scripting
Conservative approach
53. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Our case
Sequenced organism:
E. coli K-12 AG100 growing 24h in M9 medium
Reference genome:
E. coli K-12 MG1655 (available online)
Data (preprocessed):
Reads mapping to reference MG1655: 95.84%
Mean coverage depth: 88.19x (based on MG1655)
Read min/max/mean length (bp): 15 / 99 / 72.17
54. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
The pipeline
Read preprocessing (standard)
Mapping to reference MG1655 (standard)
Call Structural Variations (SVs)
Assemble unmapped and mapped data
Make intermediate reference
Check SVs and call SNPs
Functional annotation
55. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Clean align reads
Reads preprocessing
[fastq-mcf, samtools]
Mapping to reference
[BWA, IGV]
56. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
The pipeline
Read preprocessing (standard)
Mapping to reference MG1655 (standard)
Call Structural Variations (SVs)
Assemble unmapped and mapped data
Make intermediate reference
Check SVs and call SNPs
Functional annotation
57. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Structural Variations (SVs)
Use Pindel to call SVs
Deletions, Insertions,
Inversions, Translocations
Indels
Break points
Visually checked [IGV]:
Deletions: 5 (total 47kbp)
Indels: 9
Break points: 9
58. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
The pipeline
Read preprocessing (standard)
Mapping to reference MG1655 (standard)
Call Structural Variations (SVs)
Assemble unmapped and mapped data
Make intermediate reference
Check SVs and call SNPs
Functional annotation
59. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
SVs application and assemble unmapped reads
Take the close relative genome
Break in sequences by applying SVs
Extract reads around removed regions
Extract reads not mapped to reference
Assemble ∪ −→
Scaold ∪
[PythonBash scripting, Samtools, Velvet,
SSpace, Gapller]
60. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
The pipeline
Read preprocessing (standard)
Mapping to reference MG1655 (standard)
Call Structural Variations (SVs)
Assemble unmapped and mapped data
Make intermediate reference
Check SVs and call SNPs
Functional annotation
61. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Making intermediate and tailored references
Making Intermediate reference
Order scaolds w.r.t reference [Mauve]
Concatenate the 13 aligned scaolds
[Bash one-liner]
Making tailored reference
Look for SVs (none should be present)
Call SNPs [VarScan, vcftools]
Annotation
Export annotation from reference [RATT]
Adjust and annotate missing parts [RAST,
manually edit]
Make le ENA compatible [Python script]
62. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
In my experience, people do not look at assemblies critically
enough [Nature Methods, 2012]
Clean results need designed protocols, time, and money
Leap forwards has been done recently,
but the sequencing cost is still not very low
[Nature Methods, June 2013; KorenPhillippy, Curr Opin
Microbiol, 2015]
64. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
SNPs ribosomal: mutations in the control
Hypothesis: antibiotics slow down adaptation for optimal growth
in culture
Heightened ribosomal demand due to rapid growth
[Condon et al., J Bacteriol 1995]
% mean variant frequency(replicates, if not all)
50%-50% Dox/Ery Control
operon position relative posn
24h 96h 24h 96h 24h 96h
rrnH 226,521 595 5(2)
227,791 1,865 3(1)
17
rrnG 2,723,624 1,865 3(1)
9
2,724,894 595 8
rrnD 3,421,431 1,865 4(1)
13
3,422,701 595 8
rrnC 3,940,810 595 4(1)
17
rrnA 4,034,586 555 7
rrnB 4,165,708 595 4(1)
8
4,166,978 1,865 10
rrnE 4,207,110 595 3(1)
9
4,208,380 1,865 5(1)
7
SNPs signicantly dierent in frequency (ANOVA)
Maybe these ribosomal mutations helps with α-amino acid
starvation, because. . .
65. Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
tauA expressed only under condition of sulfate or cysteine
(α-amino acid) starvation [Eichhorn et al, J Bacteriol, 2000]
yqhC regulates a scavenger of toxic aldehydes produced by lipid
peroxidation [Jarobe et al, Appl Microbiol Biotechnol, 2011]
% mean variant frequency(replicates, if not all)
50%-50% Dox/Ery Control
gene position 24h 96h 24h 96h 24h 96h annotation
DUPLICATED REGION
tauA 384,897 19(1)
68 taurine transport system
yqhC 3,151,384 45 putative ARAC-type
regulatory protein