SlideShare a Scribd company logo
1 of 44
Recent Advancements in the
Statistical Analysis of Microbial
Metagenomic Sequence Data
Nicholas J. Ollberding, PhD
Associate Professor
Division of Biostatistics and Epidemiology
nicholas.ollberding@cchmc.org
Goals
• High level introduction to several methodologic issues impacting
research into the role of the microbiome on human health
• Improvements in OTU clustering algorithms
• Log-ratio methodology to analyze compositional microbiome data
• Approaches for analyzing longitudinal studies and meta-analyses
• Topics important in improving applied microbiome research
• Touch on the implementation of several methods using open-source
software
Part 1. 16S rRNA Amplicon Sequencing
I. Important use cases and limitations of amplicon sequencing
II. Conceptual difference between OTUs vs. ASVs
III. Shallow shotgun sequencing as possible alternative to 16S
Why Consider Amplicon Sequencing?
• Hypotheses specific to community composition
• Low biomass environment or host-contaminated samples
• Compare results to large existing public datasets
• Cost-effective
Operational Taxonomic Units (OTUs)
• Operational definition used to classify groups of closely related
individuals (sequence similarity) Sokal and Sneath 1963
• NGS technologies have low, but non-zero, error rates
• Need an approach to resolve those errors when attempting to cluster
sequences for downstream analysis
• % similarity OTU clustering (97% OTU):
• Deal with errors by clustering reads together that share > 97% similarity
• Dereplicate and align, compute distance, cluster (agglomerative, greedy)
• Returns subset sequences where all pairs of OTUs are <97% identical
• 97% does NOT reflect species
Operational Taxonomic Units (OTUs)
https://benjjneb.github.io/dada2/SMBS_DADA2.pdf
Problem 1: OTU obscures true biological variability
Problem 2: Generally have many rare clusters that share < 97% similarity
Amplicon Sequence Variants (ASVs)
• Unique sequence derived from amplicon sequencing
• Still have errors and need to resolve…
• Goal should be to report all correct biological sequences in the reads
• Denoising algorithms attempt to achieve this goal by:
• Dereplicating and aligning reads
• Assume all read start from a single partition
• New partitions only formed if given sequence passes some abundance and skew
threshold where:
• Abundance = is the number of times the read is observed
• Skew = how different read is from potential parent sequence
Amplicon Sequence Variants (ASVs)
Schematic of UNOISE approach. Green dots are true sequences. Red dots are sequencing errors. The size of
the dot reflects the abundance. d reflects the number of divergent bases from X. B and E have low
divergence, but high abundance, so form new partitions. Others have insufficient abundance and divergence
to form new partitions. G is errantly not allowed to form new partition and highlights the limits of resolving
rare variants. E shows example of how false positives are formed. Edgar, bioRxiv, 2016.
• Improved sensitivity and
specificity
• Single nucleotide resolution
• More refined taxonomic classification
• Conceptually clearer
• Scales well to large datasets
https://benjjneb.github.io/dada2/index.html
16S Abundance and Amplification Bias
• 16S copy number variation
• Prokaryotic genomes contain ~1 to 10 copies of the 16S rRNA gene
• Strains with more copies become artificially abundant
• Primer mismatches
• PCR amplification degraded if template has mismatches with primers
• Order of magnitude suppression for each mismatched position
• Uneven mixing of degenerate primers
• Biases can occur due to unevenness in the primer mixture
• Small biases amplified by PCR (i.e. 10% bias after 20 rounds = 1.120 = ~7x )
• 16S read abundance does well not correlate with species abundance
R. Edgar, https://drive5.com/usearch/manual/amplification_bias
How to Interpret 16S Amplicon Abundances?
• Relative abundance of 16S rRNA gene copies…
• Given a specific hypervariable region and choice of primers
• PCR amplification, depth of sequencing, instrument error rate
• Error correction
• Cross-talk, contamination, wet-lab approaches, etc…
• Wise not to assume reflects true underlying relative abundance
• However, bias not expected to differ based on outcome
• So tests for differences across samples expected to remain valid
Shallow Shotgun Metagenomics
• Shallow (500k/sample) well-
approximated deep shotgun (2.5B)
• Alpha-diversity and beta-diversity
• Species and functional composition
• Biomarker discovery
• Illumina HiSeq with multiplexing
modifications
• $99 sample at 2M reads/sample
• Limitations:
• Too shallow for assembly
• Too shallow to call strains
• Deeper seq. for taxa < 0.05% rel. abund.
• Database dependent (oral, fecal)
Hillmann, mSystems, 2018
Full-Length 16s rRNA Gene Sequencing
+
Ë­
near perfect accuracy on Zymo and HMP mock communities
Callahan, bioRxiv, 2018
Part 2. CoDA for Microbiome Data
I. Why microbiome data are compositional and what are the implications
II. Approaches to account for compositional data
III. CoDA workflow for microbiome data
Interpreting Compositional Data
“There is a tendency in some compositional data analysts to expect too
much in their inferences from compositional data” – John Aitchison
Thought experiment:
- Planter that contains: 18 kg H2O, 12 kg soil, 6 kg seed (truth)
- Sample at night: xnight = (0.5, 0.33, 0.17)
- Sample in morning: xmorning = (0.67, 0.22, 0.11)
- Proportions consistent with:
- 36 kg H2O, 12 kg soil, 6 kg seed – it rained last night (H20 ↑)
- 18 kg H2O, 6 kg soil, 3 kg seed – wind blew out soil and seed (soil ↓, seed ↓)
- 27 kg H2O, 9 kg soil, 4.5 kg seed – wind and rain! (H20 ↑, soil ↓, seed ↓)
- Compositions provide relative information and cannot determine absolute changes without external
information
Aitchison, Mathematical Geology, 2005.
Compositional Data in a Nutshell
• Vectors of non-negative components showing relative importance of
parts in a total
• Total sum is an artifact of the sampling procedure
• Scale invariant - doesn’t matter if components represented as proportions,
percents, ppm, etc.
• No individual component can be interpreted in isolation
• Carries no information on the absolute increment/decrement of components
• Sample space (set of possible values) is constrained by the unit sum
• Components are not free to vary individually
Tolosana, Compositional Data in a Nutshell, 2008.
Why Do we Care?
• Compositional data violate assumptions of standard statistical tests
• Unit sum constraint induces spurious correlations (Pearson 1897)
• Independent taxa can appear correlated when working with relative abundances
• Constraint distorts multivariate patterns of variability
• Ordinations distorted and can change for given sub-compositions
• Components are not independent
• Difficult to form meaningful hypotheses in regression/ANOVA for predictor
components in a mixture
• Inflated error rates for compositional outcomes
Xia et. al, Statistical Analysis of Microbiome Data with R, 2018.
NGS Returns Compositional Counts
• Microbiome experiment
• Extract DNA and generate library
• Sequence a random sample from library
• Total reads per sample equals:
• Total available reads
• Divided by number of samples multiplexed
• Counts carry only relative information
• Provide no information on absolute increment/decrement
• Precisely definition compositional data
• Need methods for this compositionality
Anahtar et. al., J. Vis. Exp., 2016
Accounting for Compositional Data
• Standardize read counts using external spike-in to recover ratios of
absolute abundances
• Normalize counts to account for variation in sequencing depths
• Calculate “size factor” for each sample as an estimate of standardized library size
• Divide the read counts by size factor to produce normalized data
• Works well when most taxa are invariant to the condition under study
• Simply converting to proportions does not remove compositional constraint
• Log-ratio transformation to move from constrained to Euclidian space
• Change in coordinate system to remove compositional constraint
• Ratio of each taxa relative to some basis on the logarithmic scale
• CLR: ratio of each component to geometric mean of all others on log scale
Log-Ratio Coordinates/Transforms
• Additive log-ratio:
• alr(x) = [ln(xi/xD)]
• Centered log-ratio:
• clr(x) = [ln(xi/g(x))]
• Isometric log-ratio:
• ilr(x) = (𝑟𝑠/𝑟 + 𝑠) log(𝑔(𝑥𝑟)/𝑔(𝑥𝑠) where sequential partitioning of
components is used to form an orthonormal basis (e.g. pivot coordinates or
balances)
CoDA Microbiome Analysis Framework
• Common analyses have
compositional counterparts
• Mature R packages
• CoDA now recommend
approach in QIIME2
• Area of active development
• Numerous simulations suggest
perform as well or better than
count-based normalization
Gloor et. al., Frontiers in Microbiology, 2017.
Ordination Using Aitchison Distance
CLR transform
ps_clr <- microbiome::transform(ps_hmp4,
"clr")
Calculate Aitchison distance
ord_clr <- phyloseq::ordinate(ps_clr, "RDA")
Plot ordination
phyloseq::plot_ordination(ps_clr, ord_clr,
type="samples", color="HMP_BODY_SUBSITE")
Data from the HPM 1 (16S V3-V5 primer set) obtained using the HMP16SData package in R.
CoDA Wilcoxon Test: ALDEx2
Pull out OTU table and condition
x <- data.frame(t((otu_table(ps_jm))))
y <- as.character(sample_data(ps_jm)$group_2)
Run ALDEx2 for 2-group comparison
jm_aldex <- aldex(x, y, denom = "iqlr")
Effect size plot
aldex.plot(jm_aldex, type = "MW", test = "wilcox")
Differentially abundant OTUs in murine stool according to foster genotype. BH-
FDR <0.1 in red. OTUs < mean abundance (i.e. rare) shown in light grey. Fernandes, Microbiome, 2014.
Limitations of CoDA
• Logarithm of zero is undefined
• General approach is to impute small value
• Active area of research
• Log-ratio transformation cannot be used to identify exactly which
microbes are changing in DA testing
• Some still recommend count models for metagenomics:
• Large number of zeros distinctly problematic
• Zero-inflated models may better reflect true structural zeros
• Log-ratio normalization can be used to account for differences in library size
Normalization Via External Spike-Ins
• Spike-in allow recapturing ratios of the taxon absolute
abundances
• Normalization analogous to additive log-ratio transformation
Stammler, Microbiome, 2016.
Part III. Approaches for Analyzing
Longitudinal Metagenomic Data
I. Benefits of longitudinal study designs
II. Mixed-effects models for repeated measures data
III. Non-parametric alternatives
Longitudinal Microbiome Studies
• Critical to understanding change over time
• Many large, longitudinal studies due to low cost of 16S sequencing
• Benefits:
• Establish sequence of events
• Examine within-subject change
• Compare to previous time point
• Baseline
• Pre-intervention
Stool samples on 112 children. Pannaraj, JAMA Peds., 2017
Mixed-Effects Models
• Longitudinal studies have time-dependent properties:
• Inherent ordering of samples (i.e. time)
• Statistical dependences that are a function of time (i.e. within-subject correlation)
• Ignoring these properties can lead to wrong conclusions
• Averaging repeated measures or using data from single points inefficient
• Modeling repeated measures using mixed models allows for:
• Proper accounting of time-dependent correlations
• Proper estimation of standard errors
• Can conceptualize as level 1 and 2 sub-models
Modeling Alpha-Diversity: LMER
Raw data Model predictions
• LMER can be used to model linear trajectories and to test for group differences
• Functional form can be expanded to model change over time more flexibly
Outliers Abound in Microbiome Data!
• Outliers are the rule not the exception!
• Can bias regression parameters in
parametric models
• Can result in differences that are driven
by relatively few samples
• Primarily interested in identifying taxa
that “consistently” differ in abundance
• Truncating outliers common approach
• Robust regression or non-parametric
approaches may be useful
Oral microbiome samples following admission to ICU. Row 1 pathogens, Row 2 common oral, Row 3 common fecal.
Random Forest Regression
• Machine learning approaches to ID taxa
predictive of change over time
• Down weight outliers since goal is to identify
features that improve prediction over all samples
• Automatic generation of:
• Train and test sets
• Feature reduction
• Model performance
• Important features
• Implement various learners
• GBM, SVM, etc.
• Control charts to examine change
Bokulich, mSystems, 2018
Beta-Diversity: First Differences
• Calculates within-subject beta-diversity for each subject
• Baseline can be previous sample, T0, or other paired sample
Bokulich, mSystems, 2018
Microbial Maturation Index
• Predicts subject age as a function of
microbiota composition
• Trained on a subset of control group
samples then predicts z-score for all
samples
Bokulich, mSystems, 2018
SplinectomeR
• Uses loess splines to model change over time
• Assume no distribution or functional form
• Null distributions generated by permutation
• Test whether two groups follow different
trajectories over time
• Permutes group labels
• Test for difference between groups at defined
intervals
• Tests for non-zero change in single population
• Permutes time points and compares to linear baseline
Shields-Cutler, Frontiers in Microbiology, 2018.
https://github.com/RRShieldsCutler/splinectomeR
Longitudinal Clustering
• Approaches exist to cluster
longitudinal samples:
• K-means clustering
• Limited to one or two features
• Functional PCA
• Estimate smooth trajectories from
sparse longitudinal data
• Perform on taxon abundance table
• Obtain multiple PCs
• Suggesting groups with different
change trajectories
Longitudinal k-means clustering as implemented by the
kml R package. Mean trajectory of each partition colored.
Genolini, J. Stat. Software, 2015
Part IV. Microbiome Meta-Analyses
I. Qiita for metagenomic meta-analyses
Why Microbiome Meta-Analyses?
• Examine your samples in relation to other similar studies
• Compare to various control populations
• Assess reproducibility of results across studies
• Consistency of results
• Gain insight into reasons for discrepancies
• Compare results using standardized workflow
• More data should bring more insights…
IBD
Recurrence
Risk
Why Qiita?
• Database of >150,000 public samples (500k total)
• Metadata conforming to MIxS standards
• Standardized workflows using user-friendly web interface
• Analyses start from primary data (e.g. fastq)
• Redbiome to query metadata fields
• Other benefits:
• Computing on UCSD servers, ENA-EBI data deposition, projects kept private and/or make
public after embargo, use as GUI microbiome workflow for 16S and shotgun
metagenomics, QIIME2 analysis plug-ins, output processed data, other sample types,
metadata templates, …
Gonzalez, Nature Methods, 2018.
Publically Accessible Samples
Geographic Diversity
Figure generated on 3-15-18
Qiita Metagenomic Read Processing Overview
Ordination for Example IBD Meta-Analysis
Cohort Diagnosis
https://qiita.ucsd.edu/analysis/description/15093/
Next steps: overlay your samples, restrict to those with calprotectin levels, test for differences in specific taxa, etc..
Quickly Generate Visualizations
Taxonomic Compositions
Core Taxa
Alpha-Diversity Correlations
Faiths
PD
BMI
https://qiita.ucsd.edu/analysis/list/
Summary Points
• ASVs should replace OTUs for marker gene studies
• NGS returns compositional counts
• Compositional data analysis methods should be considered
• Impact of zeros on CoDA approaches active area of research
• Normalization assumes removed compositional constraint
• Longitudinal designs confer many benefits
• Many ongoing methodological developments
• Integrating samples across studies may help advance understanding
of the role of the microbiome in human health

More Related Content

Similar to DHC Microbiome Presentation 4-23-19.pptx

Metabarcoding QIIME2 workshop - Denoise
Metabarcoding QIIME2 workshop - DenoiseMetabarcoding QIIME2 workshop - Denoise
Metabarcoding QIIME2 workshop - DenoiseEvelien Jongepier
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment DesignYaoyu Wang
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataUC Davis
 
How to analyse bulk transcriptomic data using Deseq2
How to analyse bulk transcriptomic data using Deseq2How to analyse bulk transcriptomic data using Deseq2
How to analyse bulk transcriptomic data using Deseq2AdamCribbs1
 
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...Vahid Taslimitehrani
 
GLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopGLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopMorgan Langille
 
Giab agbt SVs_2019
Giab agbt SVs_2019Giab agbt SVs_2019
Giab agbt SVs_2019GenomeInABottle
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data AnalysisRavi Gandham
 
Pitfalls of multivariate pattern analysis(MVPA), fMRI
Pitfalls of multivariate pattern analysis(MVPA), fMRI Pitfalls of multivariate pattern analysis(MVPA), fMRI
Pitfalls of multivariate pattern analysis(MVPA), fMRI Emily Yunha Shin
 
A short introduction to single-cell RNA-seq analyses
A short introduction to single-cell RNA-seq analysesA short introduction to single-cell RNA-seq analyses
A short introduction to single-cell RNA-seq analysestuxette
 
A data-intensive assessment of the species abundance distribution
A data-intensive assessment of the species abundance distributionA data-intensive assessment of the species abundance distribution
A data-intensive assessment of the species abundance distributionElita Baldridge
 
Microbial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from UnculturedMicrobial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from UnculturedJonathan Eisen
 
Genome wide association mapping
Genome wide association mappingGenome wide association mapping
Genome wide association mappingAvjinder (Avi) Kaler
 
CDAC 2018 Merico optimal scoring
CDAC 2018 Merico optimal scoringCDAC 2018 Merico optimal scoring
CDAC 2018 Merico optimal scoringMarco Antoniotti
 
Genome wide association studies---In genomics, a genome-wide association stud...
Genome wide association studies---In genomics, a genome-wide association stud...Genome wide association studies---In genomics, a genome-wide association stud...
Genome wide association studies---In genomics, a genome-wide association stud...DrAmitJoshi9
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016GenomeInABottle
 
Why Your Microbiome Analysis is Wrong
Why Your Microbiome Analysis is WrongWhy Your Microbiome Analysis is Wrong
Why Your Microbiome Analysis is WrongIddo
 

Similar to DHC Microbiome Presentation 4-23-19.pptx (20)

Metabarcoding QIIME2 workshop - Denoise
Metabarcoding QIIME2 workshop - DenoiseMetabarcoding QIIME2 workshop - Denoise
Metabarcoding QIIME2 workshop - Denoise
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic Data
 
How to analyse bulk transcriptomic data using Deseq2
How to analyse bulk transcriptomic data using Deseq2How to analyse bulk transcriptomic data using Deseq2
How to analyse bulk transcriptomic data using Deseq2
 
Bioinformatica t4-alignments
Bioinformatica t4-alignmentsBioinformatica t4-alignments
Bioinformatica t4-alignments
 
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
GLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopGLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics Workshop
 
Giab agbt SVs_2019
Giab agbt SVs_2019Giab agbt SVs_2019
Giab agbt SVs_2019
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data Analysis
 
Pitfalls of multivariate pattern analysis(MVPA), fMRI
Pitfalls of multivariate pattern analysis(MVPA), fMRI Pitfalls of multivariate pattern analysis(MVPA), fMRI
Pitfalls of multivariate pattern analysis(MVPA), fMRI
 
A short introduction to single-cell RNA-seq analyses
A short introduction to single-cell RNA-seq analysesA short introduction to single-cell RNA-seq analyses
A short introduction to single-cell RNA-seq analyses
 
A data-intensive assessment of the species abundance distribution
A data-intensive assessment of the species abundance distributionA data-intensive assessment of the species abundance distribution
A data-intensive assessment of the species abundance distribution
 
Microbial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from UnculturedMicrobial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
 
Genome wide association mapping
Genome wide association mappingGenome wide association mapping
Genome wide association mapping
 
CDAC 2018 Merico optimal scoring
CDAC 2018 Merico optimal scoringCDAC 2018 Merico optimal scoring
CDAC 2018 Merico optimal scoring
 
Genome wide association studies---In genomics, a genome-wide association stud...
Genome wide association studies---In genomics, a genome-wide association stud...Genome wide association studies---In genomics, a genome-wide association stud...
Genome wide association studies---In genomics, a genome-wide association stud...
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
 
Why Your Microbiome Analysis is Wrong
Why Your Microbiome Analysis is WrongWhy Your Microbiome Analysis is Wrong
Why Your Microbiome Analysis is Wrong
 
Cufflinks
CufflinksCufflinks
Cufflinks
 

More from DivyanshGupta922023

(Public) FedCM BlinkOn 16 fedcm and privacy sandbox apis
(Public) FedCM BlinkOn 16 fedcm and privacy sandbox apis(Public) FedCM BlinkOn 16 fedcm and privacy sandbox apis
(Public) FedCM BlinkOn 16 fedcm and privacy sandbox apisDivyanshGupta922023
 
DevOps The Buzzword - everything about devops
DevOps The Buzzword - everything about devopsDevOps The Buzzword - everything about devops
DevOps The Buzzword - everything about devopsDivyanshGupta922023
 
Git Basics walkthough to all basic concept and commands of git
Git Basics walkthough to all basic concept and commands of gitGit Basics walkthough to all basic concept and commands of git
Git Basics walkthough to all basic concept and commands of gitDivyanshGupta922023
 
jquery summit presentation for large scale javascript applications
jquery summit  presentation for large scale javascript applicationsjquery summit  presentation for large scale javascript applications
jquery summit presentation for large scale javascript applicationsDivyanshGupta922023
 
Next.js - ReactPlayIO.pptx
Next.js - ReactPlayIO.pptxNext.js - ReactPlayIO.pptx
Next.js - ReactPlayIO.pptxDivyanshGupta922023
 
api-driven-development.pdf
api-driven-development.pdfapi-driven-development.pdf
api-driven-development.pdfDivyanshGupta922023
 
10-security-concepts-lightning-talk 1of2.pptx
10-security-concepts-lightning-talk 1of2.pptx10-security-concepts-lightning-talk 1of2.pptx
10-security-concepts-lightning-talk 1of2.pptxDivyanshGupta922023
 
Introduction to Directed Acyclic Graphs.pptx
Introduction to Directed Acyclic Graphs.pptxIntroduction to Directed Acyclic Graphs.pptx
Introduction to Directed Acyclic Graphs.pptxDivyanshGupta922023
 

More from DivyanshGupta922023 (17)

(Public) FedCM BlinkOn 16 fedcm and privacy sandbox apis
(Public) FedCM BlinkOn 16 fedcm and privacy sandbox apis(Public) FedCM BlinkOn 16 fedcm and privacy sandbox apis
(Public) FedCM BlinkOn 16 fedcm and privacy sandbox apis
 
DevOps The Buzzword - everything about devops
DevOps The Buzzword - everything about devopsDevOps The Buzzword - everything about devops
DevOps The Buzzword - everything about devops
 
Git Basics walkthough to all basic concept and commands of git
Git Basics walkthough to all basic concept and commands of gitGit Basics walkthough to all basic concept and commands of git
Git Basics walkthough to all basic concept and commands of git
 
jquery summit presentation for large scale javascript applications
jquery summit  presentation for large scale javascript applicationsjquery summit  presentation for large scale javascript applications
jquery summit presentation for large scale javascript applications
 
Next.js - ReactPlayIO.pptx
Next.js - ReactPlayIO.pptxNext.js - ReactPlayIO.pptx
Next.js - ReactPlayIO.pptx
 
Management+team.pptx
Management+team.pptxManagement+team.pptx
Management+team.pptx
 
developer-burnout.pdf
developer-burnout.pdfdeveloper-burnout.pdf
developer-burnout.pdf
 
AzureIntro.pptx
AzureIntro.pptxAzureIntro.pptx
AzureIntro.pptx
 
api-driven-development.pdf
api-driven-development.pdfapi-driven-development.pdf
api-driven-development.pdf
 
Internet of Things.pptx
Internet of Things.pptxInternet of Things.pptx
Internet of Things.pptx
 
Functional JS+ ES6.pptx
Functional JS+ ES6.pptxFunctional JS+ ES6.pptx
Functional JS+ ES6.pptx
 
AAAI19-Open.pptx
AAAI19-Open.pptxAAAI19-Open.pptx
AAAI19-Open.pptx
 
10-security-concepts-lightning-talk 1of2.pptx
10-security-concepts-lightning-talk 1of2.pptx10-security-concepts-lightning-talk 1of2.pptx
10-security-concepts-lightning-talk 1of2.pptx
 
Introduction to Directed Acyclic Graphs.pptx
Introduction to Directed Acyclic Graphs.pptxIntroduction to Directed Acyclic Graphs.pptx
Introduction to Directed Acyclic Graphs.pptx
 
ReactJS presentation.pptx
ReactJS presentation.pptxReactJS presentation.pptx
ReactJS presentation.pptx
 
01-React js Intro.pptx
01-React js Intro.pptx01-React js Intro.pptx
01-React js Intro.pptx
 
Nextjs13.pptx
Nextjs13.pptxNextjs13.pptx
Nextjs13.pptx
 

Recently uploaded

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 

Recently uploaded (20)

Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 

DHC Microbiome Presentation 4-23-19.pptx

  • 1. Recent Advancements in the Statistical Analysis of Microbial Metagenomic Sequence Data Nicholas J. Ollberding, PhD Associate Professor Division of Biostatistics and Epidemiology nicholas.ollberding@cchmc.org
  • 2. Goals • High level introduction to several methodologic issues impacting research into the role of the microbiome on human health • Improvements in OTU clustering algorithms • Log-ratio methodology to analyze compositional microbiome data • Approaches for analyzing longitudinal studies and meta-analyses • Topics important in improving applied microbiome research • Touch on the implementation of several methods using open-source software
  • 3. Part 1. 16S rRNA Amplicon Sequencing I. Important use cases and limitations of amplicon sequencing II. Conceptual difference between OTUs vs. ASVs III. Shallow shotgun sequencing as possible alternative to 16S
  • 4. Why Consider Amplicon Sequencing? • Hypotheses specific to community composition • Low biomass environment or host-contaminated samples • Compare results to large existing public datasets • Cost-effective
  • 5. Operational Taxonomic Units (OTUs) • Operational definition used to classify groups of closely related individuals (sequence similarity) Sokal and Sneath 1963 • NGS technologies have low, but non-zero, error rates • Need an approach to resolve those errors when attempting to cluster sequences for downstream analysis • % similarity OTU clustering (97% OTU): • Deal with errors by clustering reads together that share > 97% similarity • Dereplicate and align, compute distance, cluster (agglomerative, greedy) • Returns subset sequences where all pairs of OTUs are <97% identical • 97% does NOT reflect species
  • 6. Operational Taxonomic Units (OTUs) https://benjjneb.github.io/dada2/SMBS_DADA2.pdf Problem 1: OTU obscures true biological variability Problem 2: Generally have many rare clusters that share < 97% similarity
  • 7. Amplicon Sequence Variants (ASVs) • Unique sequence derived from amplicon sequencing • Still have errors and need to resolve… • Goal should be to report all correct biological sequences in the reads • Denoising algorithms attempt to achieve this goal by: • Dereplicating and aligning reads • Assume all read start from a single partition • New partitions only formed if given sequence passes some abundance and skew threshold where: • Abundance = is the number of times the read is observed • Skew = how different read is from potential parent sequence
  • 8. Amplicon Sequence Variants (ASVs) Schematic of UNOISE approach. Green dots are true sequences. Red dots are sequencing errors. The size of the dot reflects the abundance. d reflects the number of divergent bases from X. B and E have low divergence, but high abundance, so form new partitions. Others have insufficient abundance and divergence to form new partitions. G is errantly not allowed to form new partition and highlights the limits of resolving rare variants. E shows example of how false positives are formed. Edgar, bioRxiv, 2016.
  • 9. • Improved sensitivity and specificity • Single nucleotide resolution • More refined taxonomic classification • Conceptually clearer • Scales well to large datasets https://benjjneb.github.io/dada2/index.html
  • 10. 16S Abundance and Amplification Bias • 16S copy number variation • Prokaryotic genomes contain ~1 to 10 copies of the 16S rRNA gene • Strains with more copies become artificially abundant • Primer mismatches • PCR amplification degraded if template has mismatches with primers • Order of magnitude suppression for each mismatched position • Uneven mixing of degenerate primers • Biases can occur due to unevenness in the primer mixture • Small biases amplified by PCR (i.e. 10% bias after 20 rounds = 1.120 = ~7x ) • 16S read abundance does well not correlate with species abundance R. Edgar, https://drive5.com/usearch/manual/amplification_bias
  • 11. How to Interpret 16S Amplicon Abundances? • Relative abundance of 16S rRNA gene copies… • Given a specific hypervariable region and choice of primers • PCR amplification, depth of sequencing, instrument error rate • Error correction • Cross-talk, contamination, wet-lab approaches, etc… • Wise not to assume reflects true underlying relative abundance • However, bias not expected to differ based on outcome • So tests for differences across samples expected to remain valid
  • 12. Shallow Shotgun Metagenomics • Shallow (500k/sample) well- approximated deep shotgun (2.5B) • Alpha-diversity and beta-diversity • Species and functional composition • Biomarker discovery • Illumina HiSeq with multiplexing modifications • $99 sample at 2M reads/sample • Limitations: • Too shallow for assembly • Too shallow to call strains • Deeper seq. for taxa < 0.05% rel. abund. • Database dependent (oral, fecal) Hillmann, mSystems, 2018
  • 13. Full-Length 16s rRNA Gene Sequencing + Ë­ near perfect accuracy on Zymo and HMP mock communities Callahan, bioRxiv, 2018
  • 14. Part 2. CoDA for Microbiome Data I. Why microbiome data are compositional and what are the implications II. Approaches to account for compositional data III. CoDA workflow for microbiome data
  • 15. Interpreting Compositional Data “There is a tendency in some compositional data analysts to expect too much in their inferences from compositional data” – John Aitchison Thought experiment: - Planter that contains: 18 kg H2O, 12 kg soil, 6 kg seed (truth) - Sample at night: xnight = (0.5, 0.33, 0.17) - Sample in morning: xmorning = (0.67, 0.22, 0.11) - Proportions consistent with: - 36 kg H2O, 12 kg soil, 6 kg seed – it rained last night (H20 ↑) - 18 kg H2O, 6 kg soil, 3 kg seed – wind blew out soil and seed (soil ↓, seed ↓) - 27 kg H2O, 9 kg soil, 4.5 kg seed – wind and rain! (H20 ↑, soil ↓, seed ↓) - Compositions provide relative information and cannot determine absolute changes without external information Aitchison, Mathematical Geology, 2005.
  • 16. Compositional Data in a Nutshell • Vectors of non-negative components showing relative importance of parts in a total • Total sum is an artifact of the sampling procedure • Scale invariant - doesn’t matter if components represented as proportions, percents, ppm, etc. • No individual component can be interpreted in isolation • Carries no information on the absolute increment/decrement of components • Sample space (set of possible values) is constrained by the unit sum • Components are not free to vary individually Tolosana, Compositional Data in a Nutshell, 2008.
  • 17. Why Do we Care? • Compositional data violate assumptions of standard statistical tests • Unit sum constraint induces spurious correlations (Pearson 1897) • Independent taxa can appear correlated when working with relative abundances • Constraint distorts multivariate patterns of variability • Ordinations distorted and can change for given sub-compositions • Components are not independent • Difficult to form meaningful hypotheses in regression/ANOVA for predictor components in a mixture • Inflated error rates for compositional outcomes Xia et. al, Statistical Analysis of Microbiome Data with R, 2018.
  • 18. NGS Returns Compositional Counts • Microbiome experiment • Extract DNA and generate library • Sequence a random sample from library • Total reads per sample equals: • Total available reads • Divided by number of samples multiplexed • Counts carry only relative information • Provide no information on absolute increment/decrement • Precisely definition compositional data • Need methods for this compositionality Anahtar et. al., J. Vis. Exp., 2016
  • 19. Accounting for Compositional Data • Standardize read counts using external spike-in to recover ratios of absolute abundances • Normalize counts to account for variation in sequencing depths • Calculate “size factor” for each sample as an estimate of standardized library size • Divide the read counts by size factor to produce normalized data • Works well when most taxa are invariant to the condition under study • Simply converting to proportions does not remove compositional constraint • Log-ratio transformation to move from constrained to Euclidian space • Change in coordinate system to remove compositional constraint • Ratio of each taxa relative to some basis on the logarithmic scale • CLR: ratio of each component to geometric mean of all others on log scale
  • 20. Log-Ratio Coordinates/Transforms • Additive log-ratio: • alr(x) = [ln(xi/xD)] • Centered log-ratio: • clr(x) = [ln(xi/g(x))] • Isometric log-ratio: • ilr(x) = (đť‘źđť‘ /đť‘ź + đť‘ ) log(đť‘”(đť‘Ąđť‘ź)/đť‘”(đť‘Ąđť‘ ) where sequential partitioning of components is used to form an orthonormal basis (e.g. pivot coordinates or balances)
  • 21. CoDA Microbiome Analysis Framework • Common analyses have compositional counterparts • Mature R packages • CoDA now recommend approach in QIIME2 • Area of active development • Numerous simulations suggest perform as well or better than count-based normalization Gloor et. al., Frontiers in Microbiology, 2017.
  • 22. Ordination Using Aitchison Distance CLR transform ps_clr <- microbiome::transform(ps_hmp4, "clr") Calculate Aitchison distance ord_clr <- phyloseq::ordinate(ps_clr, "RDA") Plot ordination phyloseq::plot_ordination(ps_clr, ord_clr, type="samples", color="HMP_BODY_SUBSITE") Data from the HPM 1 (16S V3-V5 primer set) obtained using the HMP16SData package in R.
  • 23. CoDA Wilcoxon Test: ALDEx2 Pull out OTU table and condition x <- data.frame(t((otu_table(ps_jm)))) y <- as.character(sample_data(ps_jm)$group_2) Run ALDEx2 for 2-group comparison jm_aldex <- aldex(x, y, denom = "iqlr") Effect size plot aldex.plot(jm_aldex, type = "MW", test = "wilcox") Differentially abundant OTUs in murine stool according to foster genotype. BH- FDR <0.1 in red. OTUs < mean abundance (i.e. rare) shown in light grey. Fernandes, Microbiome, 2014.
  • 24. Limitations of CoDA • Logarithm of zero is undefined • General approach is to impute small value • Active area of research • Log-ratio transformation cannot be used to identify exactly which microbes are changing in DA testing • Some still recommend count models for metagenomics: • Large number of zeros distinctly problematic • Zero-inflated models may better reflect true structural zeros • Log-ratio normalization can be used to account for differences in library size
  • 25. Normalization Via External Spike-Ins • Spike-in allow recapturing ratios of the taxon absolute abundances • Normalization analogous to additive log-ratio transformation Stammler, Microbiome, 2016.
  • 26. Part III. Approaches for Analyzing Longitudinal Metagenomic Data I. Benefits of longitudinal study designs II. Mixed-effects models for repeated measures data III. Non-parametric alternatives
  • 27. Longitudinal Microbiome Studies • Critical to understanding change over time • Many large, longitudinal studies due to low cost of 16S sequencing • Benefits: • Establish sequence of events • Examine within-subject change • Compare to previous time point • Baseline • Pre-intervention Stool samples on 112 children. Pannaraj, JAMA Peds., 2017
  • 28. Mixed-Effects Models • Longitudinal studies have time-dependent properties: • Inherent ordering of samples (i.e. time) • Statistical dependences that are a function of time (i.e. within-subject correlation) • Ignoring these properties can lead to wrong conclusions • Averaging repeated measures or using data from single points inefficient • Modeling repeated measures using mixed models allows for: • Proper accounting of time-dependent correlations • Proper estimation of standard errors • Can conceptualize as level 1 and 2 sub-models
  • 29. Modeling Alpha-Diversity: LMER Raw data Model predictions • LMER can be used to model linear trajectories and to test for group differences • Functional form can be expanded to model change over time more flexibly
  • 30. Outliers Abound in Microbiome Data! • Outliers are the rule not the exception! • Can bias regression parameters in parametric models • Can result in differences that are driven by relatively few samples • Primarily interested in identifying taxa that “consistently” differ in abundance • Truncating outliers common approach • Robust regression or non-parametric approaches may be useful Oral microbiome samples following admission to ICU. Row 1 pathogens, Row 2 common oral, Row 3 common fecal.
  • 31. Random Forest Regression • Machine learning approaches to ID taxa predictive of change over time • Down weight outliers since goal is to identify features that improve prediction over all samples • Automatic generation of: • Train and test sets • Feature reduction • Model performance • Important features • Implement various learners • GBM, SVM, etc. • Control charts to examine change Bokulich, mSystems, 2018
  • 32. Beta-Diversity: First Differences • Calculates within-subject beta-diversity for each subject • Baseline can be previous sample, T0, or other paired sample Bokulich, mSystems, 2018
  • 33. Microbial Maturation Index • Predicts subject age as a function of microbiota composition • Trained on a subset of control group samples then predicts z-score for all samples Bokulich, mSystems, 2018
  • 34. SplinectomeR • Uses loess splines to model change over time • Assume no distribution or functional form • Null distributions generated by permutation • Test whether two groups follow different trajectories over time • Permutes group labels • Test for difference between groups at defined intervals • Tests for non-zero change in single population • Permutes time points and compares to linear baseline Shields-Cutler, Frontiers in Microbiology, 2018. https://github.com/RRShieldsCutler/splinectomeR
  • 35. Longitudinal Clustering • Approaches exist to cluster longitudinal samples: • K-means clustering • Limited to one or two features • Functional PCA • Estimate smooth trajectories from sparse longitudinal data • Perform on taxon abundance table • Obtain multiple PCs • Suggesting groups with different change trajectories Longitudinal k-means clustering as implemented by the kml R package. Mean trajectory of each partition colored. Genolini, J. Stat. Software, 2015
  • 36. Part IV. Microbiome Meta-Analyses I. Qiita for metagenomic meta-analyses
  • 37. Why Microbiome Meta-Analyses? • Examine your samples in relation to other similar studies • Compare to various control populations • Assess reproducibility of results across studies • Consistency of results • Gain insight into reasons for discrepancies • Compare results using standardized workflow • More data should bring more insights… IBD Recurrence Risk
  • 38. Why Qiita? • Database of >150,000 public samples (500k total) • Metadata conforming to MIxS standards • Standardized workflows using user-friendly web interface • Analyses start from primary data (e.g. fastq) • Redbiome to query metadata fields • Other benefits: • Computing on UCSD servers, ENA-EBI data deposition, projects kept private and/or make public after embargo, use as GUI microbiome workflow for 16S and shotgun metagenomics, QIIME2 analysis plug-ins, output processed data, other sample types, metadata templates, … Gonzalez, Nature Methods, 2018.
  • 41. Qiita Metagenomic Read Processing Overview
  • 42. Ordination for Example IBD Meta-Analysis Cohort Diagnosis https://qiita.ucsd.edu/analysis/description/15093/ Next steps: overlay your samples, restrict to those with calprotectin levels, test for differences in specific taxa, etc..
  • 43. Quickly Generate Visualizations Taxonomic Compositions Core Taxa Alpha-Diversity Correlations Faiths PD BMI https://qiita.ucsd.edu/analysis/list/
  • 44. Summary Points • ASVs should replace OTUs for marker gene studies • NGS returns compositional counts • Compositional data analysis methods should be considered • Impact of zeros on CoDA approaches active area of research • Normalization assumes removed compositional constraint • Longitudinal designs confer many benefits • Many ongoing methodological developments • Integrating samples across studies may help advance understanding of the role of the microbiome in human health

Editor's Notes

  1. Mention PacBio and DADA2 shown good performance for full 16S gene region
  2. This is a great slide by Ben Callahan at North Carolina State University that gives an excellent conceptual overview of the PTU clustering process…
  3. D = stool E = oral (gingival plaque)
  4. Have to be interpreted relative to basis
  5. Compositional data exist in a subspace of the Euclidian geometry referend to as the simplex This has important consequences