SlideShare a Scribd company logo
1 of 36
Liangqun Lu
2018 - 04 - 25
Outline
● Background on Data Integration
○ Biological regulation
○ Omic data integration objectives
○ Data Integration Challenges
● Unsupervised methods and Application
○ Matrix factorization methods (iCluster+ )
○ Bayesian methods (BCC)
○ Network-based methods (SNF)
○ Multiple Kernel Learning and Multi-Step Analysis (rMKL-LPP)
2
Biological regulation
● Central dogma
3
Gene Regulatory Network
Regulatory elements
● Receptors
● Transcriptional factors
● Inhibitory factors
● Cis-trans element
Source: https://en.wikipedia.org/wiki/Gene_regulatory_network
4
Rich data
5
Single omic study
● One-dimension data explains the
diagnostics and progression for
complex disorders
● Information is limited
● Different layers of biological
system are relevant and
dependent
6
Omic data integration objectives
● Promoting precise medicine from big data
● Multiview investigation on the
completeness and complexity of the
biological system
● Discover hidden biological regularities
● Make use of complementary information
and discover biomarkers for diagnosis,
progression and treatment in human
diseases
7
Data Integration Challenges (From Computational)
● Data integration is broad
● Data heterogeneity
● Data unification
● Data noise and bias
● Data integration and dimensionality reduction
8
9
Unsupervised classification
● Matrix factorization methods (iCluster and iCluster+ )
○ Assumption: common latent variable in different data
● Bayesian methods (Bayesian consensus clustering)
○ Assumption: assumptions on data distribution and data correlation
● Network-based methods (SNF)
○ Assumption: samples relationship can be enhanced from
complementary multiple omic data
● Multiple Kernel Learning and Multi-Step Analysis (rMKL-LPP)
○ Assumption: pattern in a lower dimensional and integrative
subspace
10
Data Integration for subtype discovery
● Data Source
○ Gene expression; DNA methylation; gene mutation
● Procedures
○ Data fusion -- Clustering -- Evaluation
● Biological interpretation
○ Molecular alterations
○ Survival outcome
○ Response to therapies
11
12
iCluster and iCluster+
13
Procedure
● Data Fusion and K-means model selection
○ EM algorithm to obtain maximum
likelihood estimates
■ E-step provides a simultaneous
dimension reduction
■ M-step is to update the parameter
estimates
● Evaluation
○ Proportion of deviance -- POD (d/n^2)
○ Smaller, stronger cluster separability
○ Determine cluster number and lasso
parameter λ 15
Application on breast cancer
16
Summaries
● The joint latent variable model is completely scalable to include additional
data types
● iCluster have been applied to discover subtypes at breast cancer and
glioblastoma multiforme (GBM)
● iCluster+ makes different modeling assumptions on data types: binary,
continuous, categorical, and sequential data
17
Similarity Network Fusion (SNF)
18
SNF data fusion
1. Calculate sample similarity W in each omic dataset
using (1)
2. Calculate normalized weight matrix P from W using (2)
3. Use K nearest neighbors (KNN) to calculate local
affinity matrix S through the formulas (3) from W. P
carries the full information about the similarity of each
patient to all others whereas S only encodes the
similarity to the K most similar patients for each
patient.
4. Network fusion process: for 2 datasets, P1, S1 and P2,
S2 can be calculated, then iteratively update P1 and P2
for t steps using (4) and (5); for more than 2 datasets,
update the Ps using (5)
5. Obtain the overall fused matrix P by averaging the
updated single Ps
19
Spectral Clustering
Input X (n x n sample similarity matrix) and k clusters
Goal subgroups in a graph with disjoint cliques
Procedures:
1. Compute the normalized Laplacian L
2. Compute the first eigenvectors u and eigenvalues
for L
3. Let U be the matrix containing eigenvectors u as
columns
4. Form the matrix T from U by normalizing the rows
to norm 1
5. Cluster the points with k-means into clusters C1, ...,
Ck
20
Application: GBM subtype discovery
Evaluations:
1. P value in Cox log-rank test
2. Silhouette score
21
Summaries
● SNF can construct sample sample network by integrating multiple datasets
● SNF can be expanded to include more datasets and be applied in more
questions
22
Bayesian Consensus Clustering
● An integrative statistical model that permits a separate clustering of the
objects for each data source.
● These separate clusterings adhere loosely to an overall consensus clustering
● BCC do simultaneous estimation of both the consensus clustering and the
source-specific clusterings
23
Procedures
● Dirichlet mixture model to accommodate multiple data (X)
● Probability of belonging to one cluster
● Estimation
○ Gibbs sampling procedure to approximate the posterior distribution
○ Markov chain Monte Carlo (MCMC) proceeds by iteratively sampling
● Choose K based on highest mean adjusted adherence
24
Application on breast cancer
● RNA gene expression (GE) data
for 645 genes.
● DNA methylation (ME) data for
574 probes.
● miRNA expression (miRNA) data
for 423 miRNAs.
● Reverse phase protein array
(RPPA) data for 171 proteins.
25
26
Summaries
1. BCC model assumes a simple and general dependence between data
sources.
2. BCC models both an overall clustering and a clustering specific to each data
source, with advantages over traditional methods in terms of modeling
uncertainty and the ability to borrow information across sources.
3. BCC is suitable to work on multisource biomedical data, as well may be used
to compare clusterings from different statistical models for a single
homogeneous dataset.
27
Regularized Multiple Kernel Learning Locality
Preserving Projections (rMKL-LPP)
28
● It is an extension of the current multiple kernel learning with dimensional
reduction (MKL-DR) method, where the data are projected into a lower
dimensional and integrative subspace.
● A regularization term is added to avoid overfitting during the optimization
procedure, and it allows using several different kernel types.
● The Locality Preserving Projections (LPP) is applied to conserve the
sum of distances for each sample’s k-Nearest Neighbors.
Procedures
● Data fusion
○ rMKL-LPP
○ Optimization
○ integrated kernel matrix
● Clustering
○ K-means
○ Mean silhouette width used to optimize number of clusters
● Evaluation
○ Silhouette score and cross validation (Rand index)
29
Applications in 5 cancers
1. Comparison to state-of-the-art (SNF)
2. Robustness analysis
3. Comparison of clusterings to
established subtypes
4. Clinical implications from clusterings
30
5 cancers
1. glioblastoma multiforme (GBM) --
213 samples
2. breast invasive carcinoma (BIC) --
105 samples
3. kidney renal clear cell carcinoma
(KRCCC) -- 122 samples
4. lung squamous cell carcinoma
(LSCC) -- 106 samples
5. colon adenocarcinoma (COAD) -- 92
samplesDatasets: gene expression, DNA methylation
and miRNA expression data
1. Comparison to state-of-the-art
31
2. Robustness analysis
32
Fig. 2. Robustness of clustering for leave-one-out
datasets measured using Rand index.
Fig. 3. Robustness of clustering for leave-
one-out cross-validation applied to
reduced sized datasets measured using
Rand index.
3. Comparison of clusterings to established subtypes
33
4. Clinical implications from clusterings
34
GBM:
● 94 of 213 were
treated with
Temozolomide
35
Explain better survival
Summaries
1. rMKL-LPP found subtypes with more interesting log-rank test compared to the
state-of-the-art method
2. Several kernel matrices per data type can improve performance burdance,
remove the burden of selecting the optimal kernel matrix and have fair
stability
3. rMKL-LPP compared to unregularized MKL-DR remains stable also for small
datasets
4. The application at GBM shows to capture this diverse information within one
clustering
36
References
1. Huang, S., Chaudhary, K. & Garmire, L. X. More Is Better: Recent Progress in Multi-Omics Data
Integration Methods. Front. Genet. 8, 84 (2017).
2. Wang, B. et al. Similarity network fusion for aggregating data types on a genomic scale. Nat.
Methods 11, 333–337 (2014).
3. Shen, R., Olshen, A. B. & Ladanyi, M. Integrative clustering of multiple genomic data types using a
joint latent variable model with application to breast and lung cancer subtype analysis.
Bioinformatics 25, 2906–2912 (2009).
4. Shen, R. et al. Integrative subtype discovery in glioblastoma using iCluster. PLoS One 7, e35236
(2012).
5. Mo, Q. et al. Pattern discovery and cancer gene identification in integrated cancer genomic data.
Proc. Natl. Acad. Sci. U. S. A. 110, 4245–4250 (2013).
6. Speicher, N. K. & Pfeifer, N. Integrating different data types by regularized unsupervised multiple
kernel learning with application to cancer subtype discovery. Bioinformatics 31, i268–75 (2015).
7. Lock, E. F. & Dunson, D. B. Bayesian consensus clustering. Bioinformatics 29, 2610–2616 (2013).
37

More Related Content

What's hot

A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGA SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGijcsa
 
Improved fuzzy c-means algorithm based on a novel mechanism for the formation...
Improved fuzzy c-means algorithm based on a novel mechanism for the formation...Improved fuzzy c-means algorithm based on a novel mechanism for the formation...
Improved fuzzy c-means algorithm based on a novel mechanism for the formation...TELKOMNIKA JOURNAL
 
Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm iosrjce
 
Extended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithmExtended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithmIJMIT JOURNAL
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
 
Learning in non stationary environments
Learning in non stationary environmentsLearning in non stationary environments
Learning in non stationary environmentsSpringer
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
 
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGPATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGIJDKP
 
7. 10083 12464-1-pb
7. 10083 12464-1-pb7. 10083 12464-1-pb
7. 10083 12464-1-pbIAESIJEECS
 
Further Analysis Of A Framework To Analyze Network Performance Based On Infor...
Further Analysis Of A Framework To Analyze Network Performance Based On Infor...Further Analysis Of A Framework To Analyze Network Performance Based On Infor...
Further Analysis Of A Framework To Analyze Network Performance Based On Infor...CSCJournals
 
Data reduction techniques for high dimensional biological data
Data reduction techniques for high dimensional biological dataData reduction techniques for high dimensional biological data
Data reduction techniques for high dimensional biological dataeSAT Journals
 
Large scale cell tracking using an approximated Sinkhorn algorithm
Large scale cell tracking using an approximated Sinkhorn algorithmLarge scale cell tracking using an approximated Sinkhorn algorithm
Large scale cell tracking using an approximated Sinkhorn algorithmParth Nandedkar
 
Designing GWAS arrays for efficient imputation-based coverage
Designing GWAS arrays for efficient imputation-based coverageDesigning GWAS arrays for efficient imputation-based coverage
Designing GWAS arrays for efficient imputation-based coverageAffymetrix
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...IJCSIS Research Publications
 
Heterogeneous Information Network Embedding for Recommendation
Heterogeneous Information Network Embedding for RecommendationHeterogeneous Information Network Embedding for Recommendation
Heterogeneous Information Network Embedding for RecommendationJAYAPRAKASH JPINFOTECH
 
Heterogeneous Information Network Embedding for Recommendation
Heterogeneous Information Network Embedding for RecommendationHeterogeneous Information Network Embedding for Recommendation
Heterogeneous Information Network Embedding for RecommendationJAYAPRAKASH JPINFOTECH
 

What's hot (20)

A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGA SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
 
2009 spie hmm
2009 spie hmm2009 spie hmm
2009 spie hmm
 
Improved fuzzy c-means algorithm based on a novel mechanism for the formation...
Improved fuzzy c-means algorithm based on a novel mechanism for the formation...Improved fuzzy c-means algorithm based on a novel mechanism for the formation...
Improved fuzzy c-means algorithm based on a novel mechanism for the formation...
 
MUSEPosterCoGAPS
MUSEPosterCoGAPSMUSEPosterCoGAPS
MUSEPosterCoGAPS
 
Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm
 
Extended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithmExtended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithm
 
A new link based approach for categorical data clustering
A new link based approach for categorical data clusteringA new link based approach for categorical data clustering
A new link based approach for categorical data clustering
 
Big Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy GaussianBig Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy Gaussian
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
 
Learning in non stationary environments
Learning in non stationary environmentsLearning in non stationary environments
Learning in non stationary environments
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative Analysis
 
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGPATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
 
7. 10083 12464-1-pb
7. 10083 12464-1-pb7. 10083 12464-1-pb
7. 10083 12464-1-pb
 
Further Analysis Of A Framework To Analyze Network Performance Based On Infor...
Further Analysis Of A Framework To Analyze Network Performance Based On Infor...Further Analysis Of A Framework To Analyze Network Performance Based On Infor...
Further Analysis Of A Framework To Analyze Network Performance Based On Infor...
 
Data reduction techniques for high dimensional biological data
Data reduction techniques for high dimensional biological dataData reduction techniques for high dimensional biological data
Data reduction techniques for high dimensional biological data
 
Large scale cell tracking using an approximated Sinkhorn algorithm
Large scale cell tracking using an approximated Sinkhorn algorithmLarge scale cell tracking using an approximated Sinkhorn algorithm
Large scale cell tracking using an approximated Sinkhorn algorithm
 
Designing GWAS arrays for efficient imputation-based coverage
Designing GWAS arrays for efficient imputation-based coverageDesigning GWAS arrays for efficient imputation-based coverage
Designing GWAS arrays for efficient imputation-based coverage
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
 
Heterogeneous Information Network Embedding for Recommendation
Heterogeneous Information Network Embedding for RecommendationHeterogeneous Information Network Embedding for Recommendation
Heterogeneous Information Network Embedding for Recommendation
 
Heterogeneous Information Network Embedding for Recommendation
Heterogeneous Information Network Embedding for RecommendationHeterogeneous Information Network Embedding for Recommendation
Heterogeneous Information Network Embedding for Recommendation
 

Similar to Data integration lab_meeting

AI approaches in healthcare - targeting precise and personalized medicine
AI approaches in healthcare - targeting precise and personalized medicine AI approaches in healthcare - targeting precise and personalized medicine
AI approaches in healthcare - targeting precise and personalized medicine DayOne
 
Integrative Networks Centric Bioinformatics
Integrative Networks Centric BioinformaticsIntegrative Networks Centric Bioinformatics
Integrative Networks Centric BioinformaticsNatalio Krasnogor
 
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...Edge AI and Vision Alliance
 
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection:  Comparative StudyA Threshold Fuzzy Entropy Based Feature Selection:  Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection: Comparative StudyIJMER
 
Thesis (presentation)
Thesis (presentation)Thesis (presentation)
Thesis (presentation)nlt2390
 
Perceiver CPI.pptx
Perceiver CPI.pptxPerceiver CPI.pptx
Perceiver CPI.pptxMinJaeChung8
 
Bioinformatics-R program의 실례
Bioinformatics-R program의 실례Bioinformatics-R program의 실례
Bioinformatics-R program의 실례mothersafe
 
Cancer Analytics Poster
Cancer Analytics PosterCancer Analytics Poster
Cancer Analytics PosterMichael Atkins
 
Feature selection and microarray data
Feature selection and microarray dataFeature selection and microarray data
Feature selection and microarray dataGianluca Bontempi
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisIOSR Journals
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSEditor IJCATR
 
2014 Gene expressionmicroarrayclassification usingPCA–BEL.
2014 Gene expressionmicroarrayclassification usingPCA–BEL.2014 Gene expressionmicroarrayclassification usingPCA–BEL.
2014 Gene expressionmicroarrayclassification usingPCA–BEL.Ehsan Lotfi
 
CSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning ProjectCSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning Projectbutest
 
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringIJERD Editor
 
A new model for large dataset dimensionality reduction based on teaching lear...
A new model for large dataset dimensionality reduction based on teaching lear...A new model for large dataset dimensionality reduction based on teaching lear...
A new model for large dataset dimensionality reduction based on teaching lear...TELKOMNIKA JOURNAL
 
ANN in System Biology
ANN in System Biology ANN in System Biology
ANN in System Biology Hajra Qayyum
 

Similar to Data integration lab_meeting (20)

AI approaches in healthcare - targeting precise and personalized medicine
AI approaches in healthcare - targeting precise and personalized medicine AI approaches in healthcare - targeting precise and personalized medicine
AI approaches in healthcare - targeting precise and personalized medicine
 
Integrative Networks Centric Bioinformatics
Integrative Networks Centric BioinformaticsIntegrative Networks Centric Bioinformatics
Integrative Networks Centric Bioinformatics
 
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
 
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection:  Comparative StudyA Threshold Fuzzy Entropy Based Feature Selection:  Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
 
Thesis (presentation)
Thesis (presentation)Thesis (presentation)
Thesis (presentation)
 
Perceiver CPI.pptx
Perceiver CPI.pptxPerceiver CPI.pptx
Perceiver CPI.pptx
 
Bioinformatics-R program의 실례
Bioinformatics-R program의 실례Bioinformatics-R program의 실례
Bioinformatics-R program의 실례
 
Cancer Analytics Poster
Cancer Analytics PosterCancer Analytics Poster
Cancer Analytics Poster
 
Feature selection and microarray data
Feature selection and microarray dataFeature selection and microarray data
Feature selection and microarray data
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data Analysis
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
 
2014 Gene expressionmicroarrayclassification usingPCA–BEL.
2014 Gene expressionmicroarrayclassification usingPCA–BEL.2014 Gene expressionmicroarrayclassification usingPCA–BEL.
2014 Gene expressionmicroarrayclassification usingPCA–BEL.
 
Master's Thesis Presentation
Master's Thesis PresentationMaster's Thesis Presentation
Master's Thesis Presentation
 
I017235662
I017235662I017235662
I017235662
 
CSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning ProjectCSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning Project
 
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes Clustering
 
dream
dreamdream
dream
 
Datamining in BreastCancer.pptx
Datamining in BreastCancer.pptxDatamining in BreastCancer.pptx
Datamining in BreastCancer.pptx
 
A new model for large dataset dimensionality reduction based on teaching lear...
A new model for large dataset dimensionality reduction based on teaching lear...A new model for large dataset dimensionality reduction based on teaching lear...
A new model for large dataset dimensionality reduction based on teaching lear...
 
ANN in System Biology
ANN in System Biology ANN in System Biology
ANN in System Biology
 

More from Liangqun Lu

BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersLiangqun Lu
 
Deep Learning Application in Biology
Deep Learning Application in BiologyDeep Learning Application in Biology
Deep Learning Application in BiologyLiangqun Lu
 
Liangqun ms defense.pptx
Liangqun ms defense.pptxLiangqun ms defense.pptx
Liangqun ms defense.pptxLiangqun Lu
 
Liangqun lu 1st_gss_version2
Liangqun lu 1st_gss_version2Liangqun lu 1st_gss_version2
Liangqun lu 1st_gss_version2Liangqun Lu
 
Presentation orientation
Presentation orientationPresentation orientation
Presentation orientationLiangqun Lu
 
Journal club.pptx
Journal club.pptxJournal club.pptx
Journal club.pptxLiangqun Lu
 

More from Liangqun Lu (13)

NFL_intros.pptx
NFL_intros.pptxNFL_intros.pptx
NFL_intros.pptx
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
 
Gan summary
Gan summaryGan summary
Gan summary
 
NLP DLforDS
NLP DLforDSNLP DLforDS
NLP DLforDS
 
Lasso
LassoLasso
Lasso
 
Irgan
IrganIrgan
Irgan
 
Deep Learning Application in Biology
Deep Learning Application in BiologyDeep Learning Application in Biology
Deep Learning Application in Biology
 
Liangqun ms defense.pptx
Liangqun ms defense.pptxLiangqun ms defense.pptx
Liangqun ms defense.pptx
 
Thesis ms llq
Thesis ms llqThesis ms llq
Thesis ms llq
 
Liangqun lu 1st_gss_version2
Liangqun lu 1st_gss_version2Liangqun lu 1st_gss_version2
Liangqun lu 1st_gss_version2
 
Presentation orientation
Presentation orientationPresentation orientation
Presentation orientation
 
Journal club.pptx
Journal club.pptxJournal club.pptx
Journal club.pptx
 
Final.project
Final.projectFinal.project
Final.project
 

Recently uploaded

Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
TOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxTOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxdharshini369nike
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaPraksha3
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett SquareIsiahStephanRadaza
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555kikilily0909
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Gas_Laws_powerpoint_notes.ppt for grade 10
Gas_Laws_powerpoint_notes.ppt for grade 10Gas_Laws_powerpoint_notes.ppt for grade 10
Gas_Laws_powerpoint_notes.ppt for grade 10ROLANARIBATO3
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsHajira Mahmood
 
Cytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptxCytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptxVarshiniMK
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 

Recently uploaded (20)

Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
TOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxTOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptx
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett Square
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Gas_Laws_powerpoint_notes.ppt for grade 10
Gas_Laws_powerpoint_notes.ppt for grade 10Gas_Laws_powerpoint_notes.ppt for grade 10
Gas_Laws_powerpoint_notes.ppt for grade 10
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutions
 
Cytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptxCytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptx
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 

Data integration lab_meeting

  • 2. Outline ● Background on Data Integration ○ Biological regulation ○ Omic data integration objectives ○ Data Integration Challenges ● Unsupervised methods and Application ○ Matrix factorization methods (iCluster+ ) ○ Bayesian methods (BCC) ○ Network-based methods (SNF) ○ Multiple Kernel Learning and Multi-Step Analysis (rMKL-LPP) 2
  • 4. Gene Regulatory Network Regulatory elements ● Receptors ● Transcriptional factors ● Inhibitory factors ● Cis-trans element Source: https://en.wikipedia.org/wiki/Gene_regulatory_network 4
  • 6. Single omic study ● One-dimension data explains the diagnostics and progression for complex disorders ● Information is limited ● Different layers of biological system are relevant and dependent 6
  • 7. Omic data integration objectives ● Promoting precise medicine from big data ● Multiview investigation on the completeness and complexity of the biological system ● Discover hidden biological regularities ● Make use of complementary information and discover biomarkers for diagnosis, progression and treatment in human diseases 7
  • 8. Data Integration Challenges (From Computational) ● Data integration is broad ● Data heterogeneity ● Data unification ● Data noise and bias ● Data integration and dimensionality reduction 8
  • 9. 9
  • 10. Unsupervised classification ● Matrix factorization methods (iCluster and iCluster+ ) ○ Assumption: common latent variable in different data ● Bayesian methods (Bayesian consensus clustering) ○ Assumption: assumptions on data distribution and data correlation ● Network-based methods (SNF) ○ Assumption: samples relationship can be enhanced from complementary multiple omic data ● Multiple Kernel Learning and Multi-Step Analysis (rMKL-LPP) ○ Assumption: pattern in a lower dimensional and integrative subspace 10
  • 11. Data Integration for subtype discovery ● Data Source ○ Gene expression; DNA methylation; gene mutation ● Procedures ○ Data fusion -- Clustering -- Evaluation ● Biological interpretation ○ Molecular alterations ○ Survival outcome ○ Response to therapies 11
  • 12. 12
  • 14. Procedure ● Data Fusion and K-means model selection ○ EM algorithm to obtain maximum likelihood estimates ■ E-step provides a simultaneous dimension reduction ■ M-step is to update the parameter estimates ● Evaluation ○ Proportion of deviance -- POD (d/n^2) ○ Smaller, stronger cluster separability ○ Determine cluster number and lasso parameter λ 15
  • 16. Summaries ● The joint latent variable model is completely scalable to include additional data types ● iCluster have been applied to discover subtypes at breast cancer and glioblastoma multiforme (GBM) ● iCluster+ makes different modeling assumptions on data types: binary, continuous, categorical, and sequential data 17
  • 18. SNF data fusion 1. Calculate sample similarity W in each omic dataset using (1) 2. Calculate normalized weight matrix P from W using (2) 3. Use K nearest neighbors (KNN) to calculate local affinity matrix S through the formulas (3) from W. P carries the full information about the similarity of each patient to all others whereas S only encodes the similarity to the K most similar patients for each patient. 4. Network fusion process: for 2 datasets, P1, S1 and P2, S2 can be calculated, then iteratively update P1 and P2 for t steps using (4) and (5); for more than 2 datasets, update the Ps using (5) 5. Obtain the overall fused matrix P by averaging the updated single Ps 19
  • 19. Spectral Clustering Input X (n x n sample similarity matrix) and k clusters Goal subgroups in a graph with disjoint cliques Procedures: 1. Compute the normalized Laplacian L 2. Compute the first eigenvectors u and eigenvalues for L 3. Let U be the matrix containing eigenvectors u as columns 4. Form the matrix T from U by normalizing the rows to norm 1 5. Cluster the points with k-means into clusters C1, ..., Ck 20
  • 20. Application: GBM subtype discovery Evaluations: 1. P value in Cox log-rank test 2. Silhouette score 21
  • 21. Summaries ● SNF can construct sample sample network by integrating multiple datasets ● SNF can be expanded to include more datasets and be applied in more questions 22
  • 22. Bayesian Consensus Clustering ● An integrative statistical model that permits a separate clustering of the objects for each data source. ● These separate clusterings adhere loosely to an overall consensus clustering ● BCC do simultaneous estimation of both the consensus clustering and the source-specific clusterings 23
  • 23. Procedures ● Dirichlet mixture model to accommodate multiple data (X) ● Probability of belonging to one cluster ● Estimation ○ Gibbs sampling procedure to approximate the posterior distribution ○ Markov chain Monte Carlo (MCMC) proceeds by iteratively sampling ● Choose K based on highest mean adjusted adherence 24
  • 24. Application on breast cancer ● RNA gene expression (GE) data for 645 genes. ● DNA methylation (ME) data for 574 probes. ● miRNA expression (miRNA) data for 423 miRNAs. ● Reverse phase protein array (RPPA) data for 171 proteins. 25
  • 25. 26
  • 26. Summaries 1. BCC model assumes a simple and general dependence between data sources. 2. BCC models both an overall clustering and a clustering specific to each data source, with advantages over traditional methods in terms of modeling uncertainty and the ability to borrow information across sources. 3. BCC is suitable to work on multisource biomedical data, as well may be used to compare clusterings from different statistical models for a single homogeneous dataset. 27
  • 27. Regularized Multiple Kernel Learning Locality Preserving Projections (rMKL-LPP) 28 ● It is an extension of the current multiple kernel learning with dimensional reduction (MKL-DR) method, where the data are projected into a lower dimensional and integrative subspace. ● A regularization term is added to avoid overfitting during the optimization procedure, and it allows using several different kernel types. ● The Locality Preserving Projections (LPP) is applied to conserve the sum of distances for each sample’s k-Nearest Neighbors.
  • 28. Procedures ● Data fusion ○ rMKL-LPP ○ Optimization ○ integrated kernel matrix ● Clustering ○ K-means ○ Mean silhouette width used to optimize number of clusters ● Evaluation ○ Silhouette score and cross validation (Rand index) 29
  • 29. Applications in 5 cancers 1. Comparison to state-of-the-art (SNF) 2. Robustness analysis 3. Comparison of clusterings to established subtypes 4. Clinical implications from clusterings 30 5 cancers 1. glioblastoma multiforme (GBM) -- 213 samples 2. breast invasive carcinoma (BIC) -- 105 samples 3. kidney renal clear cell carcinoma (KRCCC) -- 122 samples 4. lung squamous cell carcinoma (LSCC) -- 106 samples 5. colon adenocarcinoma (COAD) -- 92 samplesDatasets: gene expression, DNA methylation and miRNA expression data
  • 30. 1. Comparison to state-of-the-art 31
  • 31. 2. Robustness analysis 32 Fig. 2. Robustness of clustering for leave-one-out datasets measured using Rand index. Fig. 3. Robustness of clustering for leave- one-out cross-validation applied to reduced sized datasets measured using Rand index.
  • 32. 3. Comparison of clusterings to established subtypes 33
  • 33. 4. Clinical implications from clusterings 34 GBM: ● 94 of 213 were treated with Temozolomide
  • 35. Summaries 1. rMKL-LPP found subtypes with more interesting log-rank test compared to the state-of-the-art method 2. Several kernel matrices per data type can improve performance burdance, remove the burden of selecting the optimal kernel matrix and have fair stability 3. rMKL-LPP compared to unregularized MKL-DR remains stable also for small datasets 4. The application at GBM shows to capture this diverse information within one clustering 36
  • 36. References 1. Huang, S., Chaudhary, K. & Garmire, L. X. More Is Better: Recent Progress in Multi-Omics Data Integration Methods. Front. Genet. 8, 84 (2017). 2. Wang, B. et al. Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 11, 333–337 (2014). 3. Shen, R., Olshen, A. B. & Ladanyi, M. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 25, 2906–2912 (2009). 4. Shen, R. et al. Integrative subtype discovery in glioblastoma using iCluster. PLoS One 7, e35236 (2012). 5. Mo, Q. et al. Pattern discovery and cancer gene identification in integrated cancer genomic data. Proc. Natl. Acad. Sci. U. S. A. 110, 4245–4250 (2013). 6. Speicher, N. K. & Pfeifer, N. Integrating different data types by regularized unsupervised multiple kernel learning with application to cancer subtype discovery. Bioinformatics 31, i268–75 (2015). 7. Lock, E. F. & Dunson, D. B. Bayesian consensus clustering. Bioinformatics 29, 2610–2616 (2013). 37

Editor's Notes

  1. The main advantage of Bayesian methods in data integration is that they can make assumptions not only on different types of data sets with various distributions but also on the correlations among data sets.
  2. estimating the number of clusters K and the lasso parameter λ.
  3. (C) Model selection based on POD measure. A four-cluster sparse solution (λ = 0.2) was chosen.
  4. Spectral clustering is suitable for graph clustering
  5. It is an extension of the current multiple kernel learning with dimensional reduction (MKL-DR) method MKL-DR: https://pdfs.semanticscholar.org/1cd3/bbae54b217843870fdc771d727b6043225b8.pdf
  6. Fig. 2. Robustness of clustering for leave-one-out datasets measured using Rand index. Each patient is left out once in the dimensionality reduction and clustering procedure and afterwards added to the cluster with the closest mean based on the learned projection for this data point, which is given by projðxiÞ ¼ AT Ki b. The resulting cluster assignment is then compared with the clustering of the whole dataset. The error bars represent one standard deviation Fig. 3. Robustness of clustering for leave-one-out cross-validation applied to reduced sized datasets measured using Rand index. For each cancer type, we sampled 20 times half of the patients and applied leave-one-out cross-validation as described in Section 3.4. The error bars represent one standard deviation
  7. The results are very similar to those found by Noushmehr et al. (2010) for their identified G-CIMP positive subtype. In addition, we found the set of underexpressed genes to be highly enriched for processes associated to the immune system and inflammation [cf. Table 3 (column 2)]. Since chronic inflammation is generally related to cancer progression and is thought to play an important role in the construction of the tumor microenvironment (Hanahan and Weinberg, 2011), these downregulations might be a reason for the favorable outcome of patients from this cluster.