CCBC tutorial beiko

Microbiome
Analysis
16S AND METAGENOMICS
‘

Welcome!
Your Tutorial Team:
Me (16S theory)
Mike Hall (16S practical)
Morgan Langille (metagenomics theory and practical)
Special thanks to:
Will Hsiao (CBW presentation)
2

Today’s presentation
CBW “Analysis of
metagenomic data”
3
http://bioinformatics.ca/workshops/2015/analysis-metagenomic-data-2015

Overview
Morning session
1. A brief history of molecules and microbes
2. Why 16S?
3. How 16S analysis is usually done
4. Assumptions
5. Hands-on practical
Afternoon session
1. 16S vs Metagenomics
2. Metagenome Taxonomic Composition
3. Metagenome Functional Composition
4. PICRUSt: Functional Inference
5. Hands-on practical
4

Learning objectives
At the end of the 16S tutorial, you should be able to do the following:
1. Run a simple QIIME analysis of a data set
(https://www.dropbox.com/s/kpte51nm17wav9o/stool_data.zip)
2. Interpret analysis results
3. Understand the limitations of the standard 16S analysis pipeline
5

Defining metagenomics
Microbiome: Attributed to Joshua Lederberg by Hooper and Gordon (2001):
“the collective genome of our indigenous microbes (microflora), the idea
being that a comprehensive genetic view of Homo sapiens as a life-form
should include the genes in our microbiome”
Is also used to mean microbiota, the group of microorganisms found in a
particular setting
(usage varies: be careful and precise!)
Metagenome: Handelsman et al. (1998) “…advances in molecular biology
and eukaryotic genomics, which have laid the groundwork for cloning and
functional analysis of the collective genomes of soil microflora, which we
term the metagenome of the soil.”
Does not encompass marker-gene surveys (e.g., 16S)
This report says it does.
6

Micro-what?
Metagenomics is often defined to encompass only Bacteria and Archaea
(and often Archaea are excluded too!)
Other small things to consider:
◦ Viruses / phages
◦ Microbial eukaryotes
◦ Worms (helminths, nematodes, …)
7
Lukeš et al. (2015) PLoS Pathogens

The dawn of metagenomics
3.5 BYA – the Archaean Eon
16S position 349 (-ish)
?
G A
Archaea Bacteria
8

The 16S
ribosomal RNA gene
THE FIRST WORD IN MICROBIAL BIODIVERSITY
10

11
Yarza et al. (2014)
Escherichia coli
ribosome (PDB 4YBB)
So much RNA!

Why 16S?
The “universal phylogenetic marker”
(1) Present in all living organisms
(2) Single copy* (no recombination)
(3) Highly conserved + highly variable regions
(4) Huge reference databases
12

Milestones
13
1990: “proposal for the domains Archaea, Bacteria, and Eucarya”

Milestones
14
Nature (1990)
2002: “…as much as 50% of the total
surface microbial community…”

Milestones
15
PNAS (2006)
Many critical papers
followed (error
filtering, clustering
approaches, …)

Milestones
16
Huttenhower, Gevers et al. (2012)
+ 681 metagenomic samples

16S analysis
HOW IT’S DONE
17

Your basic workflow
Sample
collection
DNA
extraction
Amplification Analysis
18

Sample collection and DNA extraction
Defined protocols exist, many kits (e.g. PowerSoil®)
Need to consider barriers to DNA recovery and PCR (e.g. humic acids
from soil, bile salts from feces)
Additional mechanical approaches (e.g., mechanical lysis of tissues with
bead beating)
Kits and rogue lab DNA can end up in your sample – need to run
negative controls!!
◦ Example from [year redacted]: shocking finding of bacterial DNA in the
[location redacted]! However, [taxonomic group redacted] was a known
frequent contaminant of DNA extraction kits.
19

20
Size fractionation
http://www.jove.com/video/52685/automated-gel-size-selection-to-improve-quality-next-generation

Choosing a PCR strategy
Need to consider:
◦ Correct melting temperature (60-65 degrees C for Illumina
protocol)
◦ DNA sequencing read length (influences choice of primers)
◦ Primer specificity!
◦ Comparability with previous studies?
[Good luck with that]
[but that’s what the Earth Microbiome Project protocol
http://www.earthmicrobiome.org/emp-standard-protocols/16s/
is meant to achieve]
21

Which variable regions to target?
V1-V3 favours Prevotella, Fusobacterium, Streptococcus, Granulicatella, Bacteroides,
Porphyromonas and Treponema
V4-V6 favours Streptococcus, Treponema, Prevotella, Eubacterium, Porphyromonas,
Campylobacter and Enterococcus.
◦ failed to detect Fusobacterium
V7-V9 favours Veillonella, Streptococcus, Eubacterium, Enterococcus, Treponema,
Catonella and Selenomonas.
◦ failed to detect Selenomonas, TM7 and Mycoplasma
22

At least there’s no shortage of options…
23
Detailed in silico evaluation of primers, experimental evaluation of two sets
Heavily biased recovery of Bacteria, Archaea, and missing groups depending on primer
choice.
“Out of the 175 primers and 512 primer pairs checked, only 10 can be recommended as
broad-range primers.”

Amplification
Example: Illumina protocol
24

Analysis
(examples mostly from QIIME)
1. Quality Control
◦ Error checking
2. Sample diversity
◦ Taxonomy agnostic
◦ Taxonomy aware
3. Similarity among samples
4. Associations with metadata/groups (ANOSIM, MRPP)
5. Machine-learning classification
6. Functional prediction
25

26
QIIME Mothur
A python interface to glue together many
programs
Single program with minimal external
dependency
Wrappers for existing programs Reimplementation of popular algorithms
Large number of dependencies / VM
available
Easy to install and setup; work best on single
multi-core server with lots of memory
More scalable Less scalable
Steeper learning curve but more flexible
workflow if you can write your own scripts
Easy to learn but workflow works the best
with built-in tools
http://www.ncbi.nlm.nih.gov/pubmed/2406
0131
http://www.mothur.org/wiki/MiSeq_SOP
Will Hsiao

“Analysis” #1
Quality Control
27
Quality score filtering:
◦ Minimal length of consecutive high-quality bases (as % of total read length)
◦ Maximal number of consecutive low-quality bases
◦ Maximal number of ambiguous bases (N’s)
◦ Minimum Phred quality score
Other quality filtering tools available
◦ Cutadapt (https://github.com/marcelm/cutadapt)
◦ Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic)
◦ Sickle (https://github.com/najoshi/sickle)
Chimera checking:
◦ UCHIME

28
Sequence quality summary using FASTQC
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Analysis #2
Within-sample (“alpha”) diversity
To describe the diversity of a sample, you need to know what you are
counting!
Individual sequences?
◦ Most precise, but vulnerable to sequencing error effects – inflation of
diversity
Clusters of sequences?
◦ Operational taxonomic units (OTUs) – 97% sequence identity as the
“species” level of similarity
Taxonomic groups?
◦ It’s always reassuring to put names on things, but taxonomic labels can be
extremely misleading
29

OTU clustering
30
Choose a % identity threshold
97%
Cluster centroids in some order
(e.g., length, abundance) – these
are reference sequences
Continue procedure until all
sequences are clustered OTU
(singletons may be excluded)
Calculate distances between sequences
6%

What’s in a name?
31
Bacteroides
Parabacteroides
Ruminococcus
???
???
???
???
Akkermansia

Taxonomic assignment
Many choices:
BLAST – assign taxonomic label of closest match (simple, possibly too simple)
Phylogenetic placement – e.g. Pplacer (Matsen et al., BMC Bioinformatics
2010)
Machine-learning classification, in particular Naïve Bayes e.g. RDP Classifier,
Wang et al. (2007) BMC Bioinformatics
32

Example RDP Classifier output
33
GD6JEAT01AYGPE Root rootrank 1.0 Bacteria domain 1.0
"Planctomycetes" phylum 1.0 "Planctomycetacia"class 1.0
Planctomycetales order 1.0 Planctomycetaceaefamily 1.0
Schlesneria genus 0.96
GD6JEAT01BEUG6 Root rootrank 1.0 Bacteria domain 1.0
Firmicutes phylum 0.32 Clostridia class 0.26
Clostridiales order 0.23 Ruminococcaceae family 0.22
Anaerotruncus genus 0.19
Includes bootstrap support

Calculating alpha diversity
OTU counts – richness only
Simpson index – probability of sampling two individuals of the same type
Phylogenetic diversity – sum of branch lengths
34

Example: human body-site diversity
35
Huttenhower, Gevers et al. (2012)

Analysis #3
Among-sample (“beta”) diversity
1. Perform pairwise comparisons between all samples to build a
dissimilarity matrix
2. Summarize the matrix using based on major patterns of covariance
or hierarchical similarity
36

Analysis #3
Given a pair of samples (described as e.g. OTU abundance), calculate
their dissimilarity
Beta-diversity measures can be:
◦ non-phylogenetic or phylogenetic
◦ weighted or unweighted
There are a lot of measures!
-Bray-Curtis (weighted, non-phylogenetic)
-Jaccard (unweighted, non-phylogenetic)
-Weighted UniFrac (weighted, phylogenetic)
-…
37

Analysis #3
How similar are the results of different
measures?
CORRELATIONS between calculated
values
38
Parks and Beiko (2013): ISME J

Analysis #3
What to do with a dissimilarity matrix?
39
Yatsunenko et al. (2012) Nature Parks and Beiko (2012) Mol Biol Evol
Ordination
Clustering

Analysis #3
Different beta-diversity measures can
yield dramatically different clusters!
40
Parks and Beiko (2013): ISME J

Analysis #4
Associations with metadata
PERMANOVA: Permutational multivariate analysis of variance
ANOSIM: Rank-based analysis of similarity
Mantel test: Comparison of between-group vs within-group distances
41
Good review: Anderson and Walsh (2013) Ecological Monographs
Example:
Weighted UniFrac distance: root compartment
explains 46.62% of variance (PERMANOVA p<0.001)
Unweighted UniFrac: root compartment explains only
18.07% of variance (PERMANOVA p<0.001); soil type
is more important

Analysis #5
Machine-learning classification
Identify aspects of community structure that are predictive of sample
attributes
Advantages of machine-learning approaches:
◦ Non-linear combinations of variables
◦ Data transformations
◦ Can accommodate many different representations of the data
Disadvantages:
◦ Complex, may “overfit”
◦ Can be time consuming
◦ Obfuscation of predictive rules
42

Random forests
(supervised_learning.py)
43
“…there are only weak and, for the most part, non-signiﬁcant associations of
particular taxa or overall diversity with the obese human gut that hold true across
different studies. However, using supervised learning with receiver operator
curves to maximize sensitivity and speciﬁcity, one can categorize subjects
according to lean and obese states with in some cases considerable accuracy…”

Tree-based classifications
Nested clade analysis
and feature selection
Classification of plaque samples
using support vector machines
44
Ning and Beiko (2015): Microbiome

Analysis #6
Functional prediction
PICRUSt: Langille et al (2013) Nat Biotechnol
45
Morgan can tell you about this…

Assumptions
THAT ARE OFTEN FALSE
46

Do not assume that
#1: 16S is an effective proxy for microbial diversity.
#2: All 16S studies are created equal, with results that are comparable.
#3: Rarefaction is a good idea.
#4: 16S OTUs describe ecologically cohesive units (“species”?).
#5: The 16S tree is the “Tree of Life”.
47

Assumption #1
16S is an effective proxy for microbial diversity.
48
rrnDB: Stoddard et al.
NAR (2014)
Estimating copy number:
Kembel et al. (2012) and
PICRUSt (coming up later)
Variation: Coenye and Vandamme (2003)

Assumption #1
16S is an effective proxy for microbial
diversity.
Alternative marker genes: cpn60, rpoB, …
Smaller reference databases!
Protein-coding genes!
49

Assumption #2
All 16S studies are created equal.
Effects of sequencing platform, V region, amplicon vs metagenomics
50
Tremblay et al. (2015)
Front Microbiol

Assumption #3
Rarefaction is a good idea.
Example of statistics before and after rarefaction:
Loss of statistical power
Random subsampling can increase false-positive differences
Arbitrary minimum library size chosen for downsampling
Alternatives e.g. Negative Binomial fitting (e.g., DeSeq2)
51
McMurdie and Holmes (2014) PLoS Comp Biol

Assumption #4
16S OTUs describe ecologically cohesive units.
52
Distribution of
sequence similarity
(dashed line = OTU threshold)
branch lengths
Nguyen et al. (2016) npj Biofilms and Microbiomes

Assumption #4
53
Hall et al., in preparation
Same OTU, different temporal patterns

Assumption #4
54
Many alternatives exist,
including Swarm: Mahé et al.
(2015) PeerJ

Assumption #5
The 16S tree is the “Tree of Life”.
16S is limited for several reasons:
Limited resolving power
Subject to compositional bias
Subject to recombination and lateral
transfer
Models typically applied to protein-
coding genes do not make sense for
noncoding RNA
55

Moving On
ADVENTURES IN “MULTI-OMICS”
56

Multi-omics??
16S can profile the biodiversity of a microbial sample…
But we need the metagenome to shine a light on function…
The metatranscriptome tells us what is expressed under specific
conditions…
And the metaproteome can quantify the relative abundance of different
enzymes…
While the metametabolome focuses on the products of metabolism.
What do we really need?
57

Metagenomic / metatranscriptomic AMD analysis - Hua et al., ISME J (2015)
Draft genomes at MG-RAST

59
Differences in the microbiome between arsenic-
exposed and control mice
16S taxonomic analysis + metametabolomics
Taxonomy
Metabolic
function

Hands on!
LET’S MAKE SCIENCE HAPPEN
60

Workflow
1. Retrieve data
2. Cluster sequences
3. Taxonomic classification
4. Phylogenetic tree construction
5. OTU table creation
6. Downstream visualization / analysis
62

FIN
63
Presentations
http://www.slideshare.net/MickWatson/studying-the-microbiome
http://bioinformatics.ca/metagenomics2015module2pptx

CCBC tutorial beiko

More Related Content

What's hot

Viewers also liked

Similar to CCBC tutorial beiko

More from beiko

Recently uploaded

CCBC tutorial beiko