Data integration lab_meeting

Outline
● Background on Data Integration
○ Biological regulation
○ Omic data integration objectives
○ Data Integration Challenges
● Unsupervised methods and Application
○ Matrix factorization methods (iCluster+ )
○ Bayesian methods (BCC)
○ Network-based methods (SNF)
○ Multiple Kernel Learning and Multi-Step Analysis (rMKL-LPP)
2

Biological regulation
● Central dogma
3

Gene Regulatory Network
Regulatory elements
● Receptors
● Transcriptional factors
● Inhibitory factors
● Cis-trans element
Source: https://en.wikipedia.org/wiki/Gene_regulatory_network
4

Single omic study
● One-dimension data explains the
diagnostics and progression for
complex disorders
● Information is limited
● Different layers of biological
system are relevant and
dependent
6

Omic data integration objectives
● Promoting precise medicine from big data
● Multiview investigation on the
completeness and complexity of the
biological system
● Discover hidden biological regularities
● Make use of complementary information
and discover biomarkers for diagnosis,
progression and treatment in human
diseases
7

Data Integration Challenges (From Computational)
● Data integration is broad
● Data heterogeneity
● Data unification
● Data noise and bias
● Data integration and dimensionality reduction
8

Unsupervised classification
● Matrix factorization methods (iCluster and iCluster+ )
○ Assumption: common latent variable in different data
● Bayesian methods (Bayesian consensus clustering)
○ Assumption: assumptions on data distribution and data correlation
● Network-based methods (SNF)
○ Assumption: samples relationship can be enhanced from
complementary multiple omic data
● Multiple Kernel Learning and Multi-Step Analysis (rMKL-LPP)
○ Assumption: pattern in a lower dimensional and integrative
subspace
10

Data Integration for subtype discovery
● Data Source
○ Gene expression; DNA methylation; gene mutation
● Procedures
○ Data fusion -- Clustering -- Evaluation
● Biological interpretation
○ Molecular alterations
○ Survival outcome
○ Response to therapies
11

Procedure
● Data Fusion and K-means model selection
○ EM algorithm to obtain maximum
likelihood estimates
■ E-step provides a simultaneous
dimension reduction
■ M-step is to update the parameter
estimates
● Evaluation
○ Proportion of deviance -- POD (d/n^2)
○ Smaller, stronger cluster separability
○ Determine cluster number and lasso
parameter λ 15

Application on breast cancer
16

Summaries
● The joint latent variable model is completely scalable to include additional
data types
● iCluster have been applied to discover subtypes at breast cancer and
glioblastoma multiforme (GBM)
● iCluster+ makes different modeling assumptions on data types: binary,
continuous, categorical, and sequential data
17

Similarity Network Fusion (SNF)
18

SNF data fusion
1. Calculate sample similarity W in each omic dataset
using (1)
2. Calculate normalized weight matrix P from W using (2)
3. Use K nearest neighbors (KNN) to calculate local
affinity matrix S through the formulas (3) from W. P
carries the full information about the similarity of each
patient to all others whereas S only encodes the
similarity to the K most similar patients for each
patient.
4. Network fusion process: for 2 datasets, P1, S1 and P2,
S2 can be calculated, then iteratively update P1 and P2
for t steps using (4) and (5); for more than 2 datasets,
update the Ps using (5)
5. Obtain the overall fused matrix P by averaging the
updated single Ps
19

Spectral Clustering
Input X (n x n sample similarity matrix) and k clusters
Goal subgroups in a graph with disjoint cliques
Procedures:
1. Compute the normalized Laplacian L
2. Compute the first eigenvectors u and eigenvalues
for L
3. Let U be the matrix containing eigenvectors u as
columns
4. Form the matrix T from U by normalizing the rows
to norm 1
5. Cluster the points with k-means into clusters C1, ...,
Ck
20

Application: GBM subtype discovery
Evaluations:
1. P value in Cox log-rank test
2. Silhouette score
21

Summaries
● SNF can construct sample sample network by integrating multiple datasets
● SNF can be expanded to include more datasets and be applied in more
questions
22

Bayesian Consensus Clustering
● An integrative statistical model that permits a separate clustering of the
objects for each data source.
● These separate clusterings adhere loosely to an overall consensus clustering
● BCC do simultaneous estimation of both the consensus clustering and the
source-specific clusterings
23

Procedures
● Dirichlet mixture model to accommodate multiple data (X)
● Probability of belonging to one cluster
● Estimation
○ Gibbs sampling procedure to approximate the posterior distribution
○ Markov chain Monte Carlo (MCMC) proceeds by iteratively sampling
● Choose K based on highest mean adjusted adherence
24

Application on breast cancer
● RNA gene expression (GE) data
for 645 genes.
● DNA methylation (ME) data for
574 probes.
● miRNA expression (miRNA) data
for 423 miRNAs.
● Reverse phase protein array
(RPPA) data for 171 proteins.
25

Summaries
1. BCC model assumes a simple and general dependence between data
sources.
2. BCC models both an overall clustering and a clustering specific to each data
source, with advantages over traditional methods in terms of modeling
uncertainty and the ability to borrow information across sources.
3. BCC is suitable to work on multisource biomedical data, as well may be used
to compare clusterings from different statistical models for a single
homogeneous dataset.
27

Regularized Multiple Kernel Learning Locality
Preserving Projections (rMKL-LPP)
28
● It is an extension of the current multiple kernel learning with dimensional
reduction (MKL-DR) method, where the data are projected into a lower
dimensional and integrative subspace.
● A regularization term is added to avoid overfitting during the optimization
procedure, and it allows using several different kernel types.
● The Locality Preserving Projections (LPP) is applied to conserve the
sum of distances for each sample’s k-Nearest Neighbors.

Procedures
● Data fusion
○ rMKL-LPP
○ Optimization
○ integrated kernel matrix
● Clustering
○ K-means
○ Mean silhouette width used to optimize number of clusters
● Evaluation
○ Silhouette score and cross validation (Rand index)
29

Applications in 5 cancers
1. Comparison to state-of-the-art (SNF)
2. Robustness analysis
3. Comparison of clusterings to
established subtypes
4. Clinical implications from clusterings
30
5 cancers
1. glioblastoma multiforme (GBM) --
213 samples
2. breast invasive carcinoma (BIC) --
105 samples
3. kidney renal clear cell carcinoma
(KRCCC) -- 122 samples
4. lung squamous cell carcinoma
(LSCC) -- 106 samples
5. colon adenocarcinoma (COAD) -- 92
samplesDatasets: gene expression, DNA methylation
and miRNA expression data

1. Comparison to state-of-the-art
31

2. Robustness analysis
32
Fig. 2. Robustness of clustering for leave-one-out
datasets measured using Rand index.
Fig. 3. Robustness of clustering for leave-
one-out cross-validation applied to
reduced sized datasets measured using
Rand index.

3. Comparison of clusterings to established subtypes
33

4. Clinical implications from clusterings
34
GBM:
● 94 of 213 were
treated with
Temozolomide

Summaries
1. rMKL-LPP found subtypes with more interesting log-rank test compared to the
state-of-the-art method
2. Several kernel matrices per data type can improve performance burdance,
remove the burden of selecting the optimal kernel matrix and have fair
stability
3. rMKL-LPP compared to unregularized MKL-DR remains stable also for small
datasets
4. The application at GBM shows to capture this diverse information within one
clustering
36

References
1. Huang, S., Chaudhary, K. & Garmire, L. X. More Is Better: Recent Progress in Multi-Omics Data
Integration Methods. Front. Genet. 8, 84 (2017).
2. Wang, B. et al. Similarity network fusion for aggregating data types on a genomic scale. Nat.
Methods 11, 333–337 (2014).
3. Shen, R., Olshen, A. B. & Ladanyi, M. Integrative clustering of multiple genomic data types using a
joint latent variable model with application to breast and lung cancer subtype analysis.
Bioinformatics 25, 2906–2912 (2009).
4. Shen, R. et al. Integrative subtype discovery in glioblastoma using iCluster. PLoS One 7, e35236
(2012).
5. Mo, Q. et al. Pattern discovery and cancer gene identification in integrated cancer genomic data.
Proc. Natl. Acad. Sci. U. S. A. 110, 4245–4250 (2013).
6. Speicher, N. K. & Pfeifer, N. Integrating different data types by regularized unsupervised multiple
kernel learning with application to cancer subtype discovery. Bioinformatics 31, i268–75 (2015).
7. Lock, E. F. & Dunson, D. B. Bayesian consensus clustering. Bioinformatics 29, 2610–2616 (2013).
37

Data integration lab_meeting

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data integration lab_meeting

Similar to Data integration lab_meeting (20)

More from Liangqun Lu

More from Liangqun Lu (13)

Recently uploaded

Recently uploaded (20)

Data integration lab_meeting

Editor's Notes