2. Outline
• Background : Importance of gut microbiota
• Measurement : Metagenomic sequencing
• Preprocessing : From sequences to OTUs by clustering
• Data analysis methods
• Future prospects
2
3. Importance of gut microbiota
3
Many diseases are revealed to be related to the gut microbio
Population : 1013
Species : 103
Wu, et al., 2011, Science Horai, et al., 2015,
Immunity
Sender, et al., 2016, Cell
Sommer, et al., 2013, Nat. Rev. Microbiol.
Autoimmune
diseases on eyes
Inflammatory
Bowel Diseases
Diabetes & Obesity
Fig : http://www.irasutoya.com/
4. Measurement : Metagenomic
sequencing
4
16S rRNA gene
Varies between species
Q. How can we know the population of microbiota?
A. Count gene sequences varying between species.
ACGTGG…
Just an error or different species?
Environment
: microbe
Genome
*1 https://www.slideshare.net/AshokSharma53/16s-classifier
*2 http://togotv.dbcls.jp/ja/pics.html
*1
Read by NGS
(Next-generation sequencing)
*2
Count
AAGTGG…
AAGTCG…
What species?
5. Preprocessing : from sequences to OTUs by clustering
5
Q. How can we define “species” in microbes?
A. Currently no widely accepted concept.
Instead, use OTUs (Operational Taxonomic Units) [Franzen, et al.,
OTUs : clustered units with <3% dissimilarity of 16S rRNA
2 ways of clustering methods:
Heuristic approach Hierarchical approach
~2010s 2010s~
Computational cost Light Heavy
Trend
Ghodsi, et al., 2011
Li, et al., 2006
Edgar, 2010
Sun, et al., 2009
Matias, et al., 2014
Literatures
6. Why was heuristic approach common?
6
2 main problems for OTUs clustering:
1. Large size of sequence reads
・100K reads/shot -> 100GB of distance matrix
・Hierarchical clustering algorithm : O(n2)
2. Computational cost on calculating distance btw. sequences
・103 bp/seq
・Sequence Alignment : O(mn)
acctggtaaa
acatgcgtata
acctg-gtaaa
acatgcgtata
s1:
s2:
s1:
s2:
7. How heuristic approach alleviates computational cost?
(1)
7
1. Large size of sequence reads
-> Greedy algorithm
ex) Start clustering from longest sequence (Li, et al., 2006)
1. Sort by length
ATGCGTGGCAG
TGGCTGGACA
ATGGCATGG
︙
ATGCGTGGCAG
2. Pick longest as seed
3. Calc. distance
& join into cluster
Problem: wrong cluster depending on clustering order (Franzen, et al., 2
seed1
seed2
d(seed2, seq.i)
d(seed1, seq.i)
But seq.i belongs to seed1 cluster
seq.i
4. iterate
>
threshold
8. 8
2. Computational cost on calculating distance btw. Sequences
-> Filtering (Li, et al., 2006) -> Prefix tree (Ghodsi, et al., 20
How heuristic approach alleviates computational cost?
(2)
ATCTGGCTAGCACCTGAGTTGA
… …
1) Find chars. complete match
(efficiently by look up table)
2) Let sequence length
If no matches found,
lower bound of mismatches is
upper bound of matching rate is
-> Filter by threshold
A
A T
T G
1) Create prefix tree for sequence
AAT…
AAG…
AT…
…
2) Reuse DP matrix for next leaf
A A T …
Query Same as
AAG…
Calculate distance for AAT…
9. Hierarchical Clustering for large data
Sun, et al., 2009, Nucleic Acids Res.
9
0. Create a sparse sorted distance matrix
skipping too dissimilar pairs by filtering
Problem : Cannot pass a full distance matrix as input due to its
Method : Sparse sorted matrix & on-the-fly processing
Algorithm (complete linkage clustering)
si 1 1 3 3 2 4
sj 2 3 4 5 3 5
dist. 0.1 0.2 0.3 0.4 0.5 0.6
1
2
4
5
3
Step 1
1
2
4
5
3
Step 2
1
2
4
5
3
Step 3
1
2
4
5
3
Step 4
1
2
4
5
3
Step 5
1
2
4
5
3
Step 6
index of seq. & distance.
10. Outline
• Background : Importance of gut microbiota
• Measurement : metagenomic sequencing
• Preprocessing : from sequences to OTUs by clustering
• Data analysis methods
• Future prospects
10
11. SGSL : Sparse Group-Subgroup LASSO
Garcia, et al., 2014, Bioinformatics
11
Phylogenetic tree
Phylum?
Family?
Genus?
Q. What are the key factors in microbiota to objectives?
#1 #2 #3 …
y
x1
x2
x3
…
…
Observed data
x1x2x3
…
…
→ find subset of x
in tree structure
correlated to y
12. Q. Estimate the values of parameters with sparse X in
linear regression :
Sparse regression by LASSO
Tibshirani, 1996, J. R. Statist. Soc. B
12
A. LASSO (Least Absolute Shrinkage and Selection Operator)
: # of data : data
: Objective variable
: Explanatory variables
: Noise
Tibshirani, 1996, J. R. Statist. Soc. B
Penalty term w/ param.
Sparse result
Constraint
Region
13. Sparse
From LASSO to SGSL
13
Lasso
(Tibshirani, 1996)
Group Lasso
(Yuan& Lin, 2006)
Sparse-group Lasso
(Simon, et al., 2012)
Sparse group
subgroup Lasso
(Garcia, et al., 2014)
Group
SparseGroup
Sparse
Group
Subgroup
14. How to estimate correlation?
Estimation of correlation from relative
population
14
16.5 17 17.5 18 18.5 19 19.5 20 20.5 21 21.5
7
8
9
10
11
12
13
0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74
0.26
0.28
0.3
0.32
0.34
0.36
0.38
0.4
Unobservable Absolute population Observable Relative population
Q. How can we infer the correlation btw. absolute population
from relative population?
#1 #2 #3 …
x1
x2
x3
…
#1 #2 #3 …
n1
n2
n3
…
15. Since , if sparse, neglig
Estimation of sparse correlation from relative
population
Friedman & Alm, 2012, PLOS Comp. Biol.
15
Solved!
16. Estimation of interaction in ecosystem
16
Q. Which microbes interacts with each other in time series
data?
Equation-based approach
Brunton, et al., 2016, PNAS
Equation-free approach
Deyle, et al., 2016, Proc. R. Soc. B
Suzuki, et al., 2017, Methods Ecol. Evol.
#1
t=1
#2
t=2
#3
t=3
…
x1
x2
x3
… Suzuki, et al., 2017, Methods Ecol. Evol.
?
Interaction?
17. VAR (vector autoregressive) model
17
: minimizing residual sum of squares
where is constant interaction from j to i
Data Model
Value at the next time step is
determined by current values
Fitting
For each i,
18. S-map for estimation of interaction
Deyle, et al., 2016, Proc. R. Soc. B
18
: minimizing weighted residual sum of squares
where is manifold dependent
interaction from j to i
Data Model
Value at the next time step is
determined by current values
Fitting
For each i,
For given , weight of for :
: parameter : normalize ter
19. Sparse S-map : extension of S-map to sparse
interaction
Suzuki, et al., 2017, Methods Ecol. Evol.
19
Sparse S-map = S-map + Stepwise variable selection + Baggin
S-map : limited variable size
1 2 N…Variables :
1
Selecte
d S-map Estimation Error
2
Selecte
d
Selecte
d …
Selecte
d
2
Selecte
d
N
Selecte
d…
1 2 3 2 N
︙
S-map Estimation Error
Step 1
Step 2
Bagging
Bagging
Selecte
d
20. Overview of analysis method for microbiota data &
Future prospect
20
Genomic
tree structure
Sparse interaction
network
Phenotypic
dynamics?
Measurement
Relative population
Other
Factor
s
Dynamics
in time
t
Dynamics
in space?
s
Control?
Editor's Notes
Genomic:ジノミック
I will conclude my presentation with some future prospects
Ten to the power of 13, ten to the power of 3
They are paying more attentions these days.
There is a certain environment. we want to know about the composition of microbes in the environment. Each microbe has its じのむ, in the genome, there is a domain,
this is a gene for ライボソーム
We can get count data for each sequence
What species do these sequences correspond to? And
はいえらーきかる
One hundred thousands
On to the ßpower of two,
Alignment is an algorithm, which arrange the sequences to identify similar region.
How do the sequences correspond to each other
How much they are similar to each other.
A over L
Y is
Residual sum of squares
If the scholar field of the cost function is ellipse like this,
I will show the path from LASSO to Sparse group subgroup lasso.
Candidate with smallest estimation error is picked up