This document provides an overview of a tutorial on analyzing microbiome data using 16S rRNA gene sequencing and metagenomics. The morning session covers the basics of 16S analysis including sample collection, PCR amplification of the 16S gene, clustering sequences into OTUs, assigning taxonomy, and calculating alpha and beta diversity. The assumptions and limitations of 16S analysis are also discussed. The afternoon session introduces metagenomics and compares it to 16S analysis. It covers taxonomic and functional profiling from metagenomic data as well as tools like PICRUSt for predicting gene functions. The document concludes by discussing the value of multi-omics approaches that integrate different types of microbiome data.
Metagenomics is the study of genetic material recovered directly from environmental samples. Metagenomics is a molecular tool used to analyse DNA acquired from environmental samples, in order to study the community of microorganisms present, without the necessity of obtaining pure cultures.
Metagenomics is the study of a collection of genetic material (genomes) from a mixed community of organisms. Metagenomics usually refers to the study of microbial communities.
Metagenome is the entire genetic information of microorganism at specific site/time. Analysis of metagenomic data could be achieved by two approaches; 1) amplicon (16s RNA gene) data analysis and whole genome metagenomics data analysis. Here we focus on 16S rRNA amplicon using Mothur Pipeline for analysis of metagenomics data.
Course: Bioinformatics for Biomedical Research (2014).
Session: 4.1- Introduction to RNA-seq and RNA-seq Data Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
Course: Bioinformatics for Biologiacl Researchers (2014).
Session: 3.1- Introduction to Metagenomics. Applications, Approaches and Tools.
Statistics and Bioinformatisc Unit (UEB) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
This presentation gives an easy introduction to ChIP-seq analyses and is part of a bioinformatics workshop. The accompanying websites are available at http://sschmeier.github.io/bioinf-workshop/#!galaxy-chipseq/
Metagenomics is the study of genetic material recovered directly from environmental samples. Metagenomics is a molecular tool used to analyse DNA acquired from environmental samples, in order to study the community of microorganisms present, without the necessity of obtaining pure cultures.
Metagenomics is the study of a collection of genetic material (genomes) from a mixed community of organisms. Metagenomics usually refers to the study of microbial communities.
Metagenome is the entire genetic information of microorganism at specific site/time. Analysis of metagenomic data could be achieved by two approaches; 1) amplicon (16s RNA gene) data analysis and whole genome metagenomics data analysis. Here we focus on 16S rRNA amplicon using Mothur Pipeline for analysis of metagenomics data.
Course: Bioinformatics for Biomedical Research (2014).
Session: 4.1- Introduction to RNA-seq and RNA-seq Data Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
Course: Bioinformatics for Biologiacl Researchers (2014).
Session: 3.1- Introduction to Metagenomics. Applications, Approaches and Tools.
Statistics and Bioinformatisc Unit (UEB) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
This presentation gives an easy introduction to ChIP-seq analyses and is part of a bioinformatics workshop. The accompanying websites are available at http://sschmeier.github.io/bioinf-workshop/#!galaxy-chipseq/
Original Next Gen Seq Methods set of slides prepared for Technorazz Vibes 2016. There is also a shorter version.
This starts with an introduction to qPCR followed by an introduction to Library Complexity. Microarrays are discussed as well along with a very short introduction to FISH. Finally discussion of Next gen seq methods is done where generation of sequencers are discussed and a short discussion of the ILLUMINA protocol. Finally comparison of ILLUMINA amongst other 3rd gen sequencer, description of the standard pipeline and the omics technologies that have risen from this seq data.
A brief introduction to amplicon sequencing of the 16S rRNA gene for the analysis of microbial diversity. This talk was presented originally at the Workshop: Introduction to Systems Biology, Aalborg Denmark. 2013-10-29
Transcriptomics is the study of RNA, single-stranded nucleic acid, which was not separated from the DNA world until the central dogma was formulated by Francis Crick in 1958, i.e., the idea that genetic information is transcribed from DNA to RNA and then translated from RNA into protein.
There are many characteristics of biological data. All these characteristics make the management of biological information a particularly challenging problem. Here mainly we will focus on characteristics of biological information and multidisciplinary field called bioinformatics. Bioinformatics, now a days has emerged with graduate degree programs in several universities.
Introduction
Transcriptome analysis
Goal of functional genomics
Why we need functional genomics
Technique
1. At DNA level
2.At RNA level
3. At protein level
4. loss of function
5. functional genomic and bioinformatics
Application
Latest research and reviews
Websites of functional genomics
Conclusions
Reference
The next generation sequencing platform of roche 454creativebiogene1
454 is totally different from Solexa and Hiseq of Illumina. The disadvantage of 454 is that it is unable to accurately measure the homopolymer length. For this unavoidable reason, 454 technology will introduce insertion and deletion sequencing errors to the results.
Original Next Gen Seq Methods set of slides prepared for Technorazz Vibes 2016. There is also a shorter version.
This starts with an introduction to qPCR followed by an introduction to Library Complexity. Microarrays are discussed as well along with a very short introduction to FISH. Finally discussion of Next gen seq methods is done where generation of sequencers are discussed and a short discussion of the ILLUMINA protocol. Finally comparison of ILLUMINA amongst other 3rd gen sequencer, description of the standard pipeline and the omics technologies that have risen from this seq data.
A brief introduction to amplicon sequencing of the 16S rRNA gene for the analysis of microbial diversity. This talk was presented originally at the Workshop: Introduction to Systems Biology, Aalborg Denmark. 2013-10-29
Transcriptomics is the study of RNA, single-stranded nucleic acid, which was not separated from the DNA world until the central dogma was formulated by Francis Crick in 1958, i.e., the idea that genetic information is transcribed from DNA to RNA and then translated from RNA into protein.
There are many characteristics of biological data. All these characteristics make the management of biological information a particularly challenging problem. Here mainly we will focus on characteristics of biological information and multidisciplinary field called bioinformatics. Bioinformatics, now a days has emerged with graduate degree programs in several universities.
Introduction
Transcriptome analysis
Goal of functional genomics
Why we need functional genomics
Technique
1. At DNA level
2.At RNA level
3. At protein level
4. loss of function
5. functional genomic and bioinformatics
Application
Latest research and reviews
Websites of functional genomics
Conclusions
Reference
The next generation sequencing platform of roche 454creativebiogene1
454 is totally different from Solexa and Hiseq of Illumina. The disadvantage of 454 is that it is unable to accurately measure the homopolymer length. For this unavoidable reason, 454 technology will introduce insertion and deletion sequencing errors to the results.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.jennomics
Presentation at a workshop conducted by the UC Davis Bioinformatics Core Facility: Using the Linux Command Line for Analysis of High Throughput Sequence Data, September 15-19, 2014
The benefits of environment specific curation of the public databases for tax...Aaron Marc Saunders
A presentation from the Workshop: Principles, potential, and limitations of novel molecular methods in water engineering; from amplicon sequencing to omics methods. Held at the Microbial Ecology and Water Engineering 2013 (MEWE 2013) July 7 – 10, 2013.
Policy Brief-Costly Disease: How to reduce out of pocket expenditure in Diabe...Anupam Singh
This policy brief assignment was submitted to instructor during my masters program in Development at AzimPremji University. This brief reflects upon the current status of diabetes in country with graphs and data points,and how it is addressed specially in the public health domain.
Disclaimer - this are the authors personal opinion and reflections build upon the data researched for academic submission purpose . it is in no way is exhaustive and claiming anything in particular in the health system. Feedback are welcome to construct and improve more on this academic assignment (Policy Brief).
Introduction to 16S rRNA gene multivariate analysisJosh Neufeld
Short introductory talk on multivariate statistics for 16S rRNA gene analysis given at the 2nd Soil Metagenomics conference in Braunschweig Germany, December 2013. A previous talk had discussed quality filtering, chimera detection, and clustering algorithms.
WHAT IS BIOINFORMATICS?
Computational Biology/Bioinformatics is the application of computer sciences and allied technologies to answer the questions of Biologists, about the mysteries of life. It has evolved to serve as the bridge between:
Observations (data) in diverse biologically-related disciplines and
The derivations of understanding (information)
APPLICATIONS OF BIOINFORMATICS
Computer Aided Drug Design
Microarray Bioinformatics
Proteomics
Genomics
Biological Databases
Phylogenetics
Systems Biology
Jake Lever - University of Glasgow
Will artificial intelligence change how readers use the research literature?
Huge advances in machine learning and natural language processing are set to upend how researchers search and consume research articles as well as change how articles are written. These new approaches are becoming adept at summarising and rewriting text, answering questions about it and extracting key information. These abilities will enable humans to search for information in new ways, such as the new ChatGPT system. They are valuable tools for researchers who curate the research literature to build knowledge bases particularly in biomedicine. Nevertheless, these approaches suffer from large problems including their computational cost and that they can confidently output incorrect information. This session will provide background on how these new methods work and discuss their benefits, challenges and potential impact.
This presentation explains the meaning of curation and includes an introduction to the Apollo genome annotation editing tool and its curation environment.
Web Apollo: Lessons learned from community-based biocuration efforts.Monica Munoz-Torres
This presentation tries to highlight the importance and relevance of community-based curation of biological data. It describes the results of harvesting expertise from dispersed researchers assigning functions to predicted and curated peptides, as well as collaborative efforts for standardization of genes and gene product attributes across species and databases.
Presentation about GenGIS at the "Visualizing Biological Data" (Vizbi 2016) meeting in Heidelberg, Germany. It was recommended that I include Fifty Shades of Gray in the title or description in order to increase the potential virality of the talk.
An overview of genomic epidemiology, Canada's IRIDA project for genome-based outbreak investigation, and a breathless romp through the awesome potential of the MinION
Is microbial ecology driven by roaming genes?beiko
Microbial ecology often makes assumptions about the relationship between phylogeny and function, but these assumptions can be invalidated by lateral gene transfer. We need to take a broader view of relationships between genes and genomes in order to make better sense out of microbes.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Cancer cell metabolism: special Reference to Lactate PathwayAADYARAJPANDEY1
Normal Cell Metabolism:
Cellular respiration describes the series of steps that cells use to break down sugar and other chemicals to get the energy we need to function.
Energy is stored in the bonds of glucose and when glucose is broken down, much of that energy is released.
Cell utilize energy in the form of ATP.
The first step of respiration is called glycolysis. In a series of steps, glycolysis breaks glucose into two smaller molecules - a chemical called pyruvate. A small amount of ATP is formed during this process.
Most healthy cells continue the breakdown in a second process, called the Kreb's cycle. The Kreb's cycle allows cells to “burn” the pyruvates made in glycolysis to get more ATP.
The last step in the breakdown of glucose is called oxidative phosphorylation (Ox-Phos).
It takes place in specialized cell structures called mitochondria. This process produces a large amount of ATP. Importantly, cells need oxygen to complete oxidative phosphorylation.
If a cell completes only glycolysis, only 2 molecules of ATP are made per glucose. However, if the cell completes the entire respiration process (glycolysis - Kreb's - oxidative phosphorylation), about 36 molecules of ATP are created, giving it much more energy to use.
IN CANCER CELL:
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
introduction to WARBERG PHENOMENA:
WARBURG EFFECT Usually, cancer cells are highly glycolytic (glucose addiction) and take up more glucose than do normal cells from outside.
Otto Heinrich Warburg (; 8 October 1883 – 1 August 1970) In 1931 was awarded the Nobel Prize in Physiology for his "discovery of the nature and mode of action of the respiratory enzyme.
WARNBURG EFFECT : cancer cells under aerobic (well-oxygenated) conditions to metabolize glucose to lactate (aerobic glycolysis) is known as the Warburg effect. Warburg made the observation that tumor slices consume glucose and secrete lactate at a higher rate than normal tissues.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Richard's entangled aventures in wonderlandRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
2. Welcome!
Your Tutorial Team:
Me (16S theory)
Mike Hall (16S practical)
Morgan Langille (metagenomics theory and practical)
Special thanks to:
Will Hsiao (CBW presentation)
2
4. Overview
Morning session
1. A brief history of molecules and microbes
2. Why 16S?
3. How 16S analysis is usually done
4. Assumptions
5. Hands-on practical
Afternoon session
1. 16S vs Metagenomics
2. Metagenome Taxonomic Composition
3. Metagenome Functional Composition
4. PICRUSt: Functional Inference
5. Hands-on practical
4
5. Learning objectives
At the end of the 16S tutorial, you should be able to do the following:
1. Run a simple QIIME analysis of a data set
(https://www.dropbox.com/s/kpte51nm17wav9o/stool_data.zip)
2. Interpret analysis results
3. Understand the limitations of the standard 16S analysis pipeline
5
6. Defining metagenomics
Microbiome: Attributed to Joshua Lederberg by Hooper and Gordon (2001):
“the collective genome of our indigenous microbes (microflora), the idea
being that a comprehensive genetic view of Homo sapiens as a life-form
should include the genes in our microbiome”
Is also used to mean microbiota, the group of microorganisms found in a
particular setting
(usage varies: be careful and precise!)
Metagenome: Handelsman et al. (1998) “…advances in molecular biology
and eukaryotic genomics, which have laid the groundwork for cloning and
functional analysis of the collective genomes of soil microflora, which we
term the metagenome of the soil.”
Does not encompass marker-gene surveys (e.g., 16S)
This report says it does.
6
7. Micro-what?
Metagenomics is often defined to encompass only Bacteria and Archaea
(and often Archaea are excluded too!)
Other small things to consider:
◦ Viruses / phages
◦ Microbial eukaryotes
◦ Worms (helminths, nematodes, …)
7
Lukeš et al. (2015) PLoS Pathogens
8. The dawn of metagenomics
3.5 BYA – the Archaean Eon
16S position 349 (-ish)
?
G A
Archaea Bacteria
8
11. 11
Yarza et al. (2014)
Escherichia coli
ribosome (PDB 4YBB)
So much RNA!
12. Why 16S?
The “universal phylogenetic marker”
(1) Present in all living organisms
(2) Single copy* (no recombination)
(3) Highly conserved + highly variable regions
(4) Huge reference databases
12
19. Sample collection and DNA extraction
Defined protocols exist, many kits (e.g. PowerSoil®)
Need to consider barriers to DNA recovery and PCR (e.g. humic acids
from soil, bile salts from feces)
Additional mechanical approaches (e.g., mechanical lysis of tissues with
bead beating)
Kits and rogue lab DNA can end up in your sample – need to run
negative controls!!
◦ Example from [year redacted]: shocking finding of bacterial DNA in the
[location redacted]! However, [taxonomic group redacted] was a known
frequent contaminant of DNA extraction kits.
19
21. Choosing a PCR strategy
Need to consider:
◦ Correct melting temperature (60-65 degrees C for Illumina
protocol)
◦ DNA sequencing read length (influences choice of primers)
◦ Primer specificity!
◦ Comparability with previous studies?
[Good luck with that]
[but that’s what the Earth Microbiome Project protocol
http://www.earthmicrobiome.org/emp-standard-protocols/16s/
is meant to achieve]
21
22. Which variable regions to target?
V1-V3 favours Prevotella, Fusobacterium, Streptococcus, Granulicatella, Bacteroides,
Porphyromonas and Treponema
V4-V6 favours Streptococcus, Treponema, Prevotella, Eubacterium, Porphyromonas,
Campylobacter and Enterococcus.
◦ failed to detect Fusobacterium
V7-V9 favours Veillonella, Streptococcus, Eubacterium, Enterococcus, Treponema,
Catonella and Selenomonas.
◦ failed to detect Selenomonas, TM7 and Mycoplasma
22
23. At least there’s no shortage of options…
23
Detailed in silico evaluation of primers, experimental evaluation of two sets
Heavily biased recovery of Bacteria, Archaea, and missing groups depending on primer
choice.
“Out of the 175 primers and 512 primer pairs checked, only 10 can be recommended as
broad-range primers.”
25. Analysis
(examples mostly from QIIME)
1. Quality Control
◦ Error checking
2. Sample diversity
◦ Taxonomy agnostic
◦ Taxonomy aware
3. Similarity among samples
4. Associations with metadata/groups (ANOSIM, MRPP)
5. Machine-learning classification
6. Functional prediction
25
26. 26
QIIME Mothur
A python interface to glue together many
programs
Single program with minimal external
dependency
Wrappers for existing programs Reimplementation of popular algorithms
Large number of dependencies / VM
available
Easy to install and setup; work best on single
multi-core server with lots of memory
More scalable Less scalable
Steeper learning curve but more flexible
workflow if you can write your own scripts
Easy to learn but workflow works the best
with built-in tools
http://www.ncbi.nlm.nih.gov/pubmed/2406
0131
http://www.mothur.org/wiki/MiSeq_SOP
Will Hsiao
27. “Analysis” #1
Quality Control
27
Quality score filtering:
◦ Minimal length of consecutive high-quality bases (as % of total read length)
◦ Maximal number of consecutive low-quality bases
◦ Maximal number of ambiguous bases (N’s)
◦ Minimum Phred quality score
Other quality filtering tools available
◦ Cutadapt (https://github.com/marcelm/cutadapt)
◦ Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic)
◦ Sickle (https://github.com/najoshi/sickle)
Chimera checking:
◦ UCHIME
29. Analysis #2
Within-sample (“alpha”) diversity
To describe the diversity of a sample, you need to know what you are
counting!
Individual sequences?
◦ Most precise, but vulnerable to sequencing error effects – inflation of
diversity
Clusters of sequences?
◦ Operational taxonomic units (OTUs) – 97% sequence identity as the
“species” level of similarity
Taxonomic groups?
◦ It’s always reassuring to put names on things, but taxonomic labels can be
extremely misleading
29
30. OTU clustering
30
Choose a % identity threshold
97%
Cluster centroids in some order
(e.g., length, abundance) – these
are reference sequences
Continue procedure until all
sequences are clustered OTU
(singletons may be excluded)
Calculate distances between sequences
6%
31. What’s in a name?
31
Bacteroides
Parabacteroides
Ruminococcus
???
???
???
???
Akkermansia
32. Taxonomic assignment
Many choices:
BLAST – assign taxonomic label of closest match (simple, possibly too simple)
Phylogenetic placement – e.g. Pplacer (Matsen et al., BMC Bioinformatics
2010)
Machine-learning classification, in particular Naïve Bayes e.g. RDP Classifier,
Wang et al. (2007) BMC Bioinformatics
32
33. Example RDP Classifier output
33
GD6JEAT01AYGPE Root rootrank 1.0 Bacteria domain 1.0
"Planctomycetes" phylum 1.0 "Planctomycetacia"class 1.0
Planctomycetales order 1.0 Planctomycetaceaefamily 1.0
Schlesneria genus 0.96
GD6JEAT01BEUG6 Root rootrank 1.0 Bacteria domain 1.0
Firmicutes phylum 0.32 Clostridia class 0.26
Clostridiales order 0.23 Ruminococcaceae family 0.22
Anaerotruncus genus 0.19
Includes bootstrap support
34. Calculating alpha diversity
OTU counts – richness only
Simpson index – probability of sampling two individuals of the same type
Phylogenetic diversity – sum of branch lengths
34
36. Analysis #3
Among-sample (“beta”) diversity
1. Perform pairwise comparisons between all samples to build a
dissimilarity matrix
2. Summarize the matrix using based on major patterns of covariance
or hierarchical similarity
36
37. Analysis #3
Among-sample (“beta”) diversity
Given a pair of samples (described as e.g. OTU abundance), calculate
their dissimilarity
Beta-diversity measures can be:
◦ non-phylogenetic or phylogenetic
◦ weighted or unweighted
There are a lot of measures!
-Bray-Curtis (weighted, non-phylogenetic)
-Jaccard (unweighted, non-phylogenetic)
-Weighted UniFrac (weighted, phylogenetic)
-…
37
38. Analysis #3
Among-sample (“beta”) diversity
How similar are the results of different
measures?
CORRELATIONS between calculated
values
38
Parks and Beiko (2013): ISME J
39. Analysis #3
Among-sample (“beta”) diversity
What to do with a dissimilarity matrix?
39
Yatsunenko et al. (2012) Nature Parks and Beiko (2012) Mol Biol Evol
Ordination
Clustering
40. Analysis #3
Among-sample (“beta”) diversity
Different beta-diversity measures can
yield dramatically different clusters!
40
Parks and Beiko (2013): ISME J
41. Analysis #4
Associations with metadata
PERMANOVA: Permutational multivariate analysis of variance
ANOSIM: Rank-based analysis of similarity
Mantel test: Comparison of between-group vs within-group distances
41
Good review: Anderson and Walsh (2013) Ecological Monographs
Example:
Weighted UniFrac distance: root compartment
explains 46.62% of variance (PERMANOVA p<0.001)
Unweighted UniFrac: root compartment explains only
18.07% of variance (PERMANOVA p<0.001); soil type
is more important
42. Analysis #5
Machine-learning classification
Identify aspects of community structure that are predictive of sample
attributes
Advantages of machine-learning approaches:
◦ Non-linear combinations of variables
◦ Data transformations
◦ Can accommodate many different representations of the data
Disadvantages:
◦ Complex, may “overfit”
◦ Can be time consuming
◦ Obfuscation of predictive rules
42
43. Random forests
(supervised_learning.py)
43
“…there are only weak and, for the most part, non-significant associations of
particular taxa or overall diversity with the obese human gut that hold true across
different studies. However, using supervised learning with receiver operator
curves to maximize sensitivity and specificity, one can categorize subjects
according to lean and obese states with in some cases considerable accuracy…”
44. Tree-based classifications
Nested clade analysis
and feature selection
Classification of plaque samples
using support vector machines
44
Ning and Beiko (2015): Microbiome
47. Do not assume that
#1: 16S is an effective proxy for microbial diversity.
#2: All 16S studies are created equal, with results that are comparable.
#3: Rarefaction is a good idea.
#4: 16S OTUs describe ecologically cohesive units (“species”?).
#5: The 16S tree is the “Tree of Life”.
47
48. Assumption #1
16S is an effective proxy for microbial diversity.
48
rrnDB: Stoddard et al.
NAR (2014)
Estimating copy number:
Kembel et al. (2012) and
PICRUSt (coming up later)
Variation: Coenye and Vandamme (2003)
49. Assumption #1
16S is an effective proxy for microbial
diversity.
Alternative marker genes: cpn60, rpoB, …
Smaller reference databases!
Protein-coding genes!
49
50. Assumption #2
All 16S studies are created equal.
Effects of sequencing platform, V region, amplicon vs metagenomics
50
Tremblay et al. (2015)
Front Microbiol
51. Assumption #3
Rarefaction is a good idea.
Example of statistics before and after rarefaction:
Loss of statistical power
Random subsampling can increase false-positive differences
Arbitrary minimum library size chosen for downsampling
Alternatives e.g. Negative Binomial fitting (e.g., DeSeq2)
51
McMurdie and Holmes (2014) PLoS Comp Biol
52. Assumption #4
16S OTUs describe ecologically cohesive units.
52
Distribution of
sequence similarity
(dashed line = OTU threshold)
branch lengths
Nguyen et al. (2016) npj Biofilms and Microbiomes
53. Assumption #4
16S OTUs describe ecologically cohesive units.
53
Hall et al., in preparation
Same OTU, different temporal patterns
54. Assumption #4
16S OTUs describe ecologically cohesive units.
54
Many alternatives exist,
including Swarm: Mahé et al.
(2015) PeerJ
55. Assumption #5
The 16S tree is the “Tree of Life”.
16S is limited for several reasons:
Limited resolving power
Subject to compositional bias
Subject to recombination and lateral
transfer
Models typically applied to protein-
coding genes do not make sense for
noncoding RNA
55
57. Multi-omics??
16S can profile the biodiversity of a microbial sample…
But we need the metagenome to shine a light on function…
The metatranscriptome tells us what is expressed under specific
conditions…
And the metaproteome can quantify the relative abundance of different
enzymes…
While the metametabolome focuses on the products of metabolism.
What do we really need?
57