Clustering/Community Detection in
Large Protein-Protein Interaction
networks
Marco Pellegrini
IIT- CNR
CDAC2018, May 22-25, 2018
Support
Why clustering PPIN
• PPIN give a (static) system-level overview of
proteomic machinery.
• PPIN have a natural modularity structure
• Functional modules, protein complexes
appears as dense sub-networks (in absolute or
relative sense)
• PPIN topology gives hints at to the role of
Proteins and Protein Interactions (e.g. hubs).
• Diseases produce network perturbations.
Excursus on PPIN clustering algorithms
• Methods from disciplines of Computer Science/OR/Graph
Theory/Social Network Analysis (e.g. MCL, RNSC, Cfinder).
• Methods specifically designed for PPI and based on topology
only (MCODE,ClusterONE,..).
• Methods designed for PPI, based on abstracting a complex
model e.g. core-attachment (e.g. COACH, CORE,…)
• Methods incorporating additional biological annotations (e.g.
functional coherence, evolutionary conservation, co-location,
PPI 3D interface constraints, dynamics of PPI, gene co-
expression,..) (e.g. DECAFF, MATISSE,…)
Timeline
• 2000/2002 : MCL
• 2003: MCODE
• 2004: RNCS
• 2005 LCMA
• 2006: Cfinder, CDPlus
• 2007: PCP, DECAFF
• 2009: COACH, CMC, HACO, CORE, RRW
• 2010: CFA, MCL-Caw, SPICi
• 2011: HC-PIN,
• 2012: ClusterONE, Prorank, Weak-Ties, CMBI
• 2013: WPNCA, ppsampler2, C-BNMF, PCIA,
• 2014: RFC, C-element….=27 and counting..
Large PPI networks and Protein Complexes
• 5 PPI-networks: Biogrid-yeast, String-HS, DIP-
yeast, Biogrid-HS, String-HS.
• 2 Complex DB: CYC2008-yeast, Corum-HS
• Basic measurements to validate the following
working hypothesis for large PPIN:
• PC are ego-networks
• PC are highly dense
• PC may be highly overlapping
Some PPI Data sets - statistics
Database species proteins interactions av degree
DIP yeast 4.637 21.107 9,1
Giogrid yeast 6.686 220.499 65,9
String yeast 5.590 133.082 47,6
Biogrid Homo S. 18.170 137.775 15,1
String Homo S. 12.717 193.150 30,3
Size of PPI networks used
0
50000
100000
150000
200000
250000
y y y hs hs
DIP Biogrid String Biogrid String
proteins
interactions
Complexes data sets
Data base species Num
comple
xes
Compl.
Size >2
Comp
l. size
≤ 2
Num proteins
CORUM Homo
sapiens
1750 1257 493 2506
CYC2008 yeast 408 236 172 1627
Used in the
evaluation
Are Complexes Dense?
Data set # CX Min
size
Max
size
mean Δ > 0.9 Δ > 0.5
DIP-
CYC2008
236 3 40 6.02 60
(25%)
131
(55%)
BG-
CYC2008
236 3 81 6.67 173
(73%)
223
(94%)
STR-
CYC2008
236 3 81 6.67 220
(93%)
235
(99%)
BG-
CORUM
1257 3 143 6.12 516
(41%)
943
(75%)
STR-
CORUM
1188 3 133 6.07 621
(52%)
981
(82%)
Are Complexes Ego-nets
Data set # CX R1 > 0.9 R1>0.5
DIP-CYC2008 236 131
(55%)
197
(83%)
BG-CYC2008 236 216
(91%)
234
(99%)
STR-CYC2008 236 235
(99%)
236
(100%)
BG-CORUM 1257 891
(70%)
1162
(92%)
STR-CORUM 1188 923
(77%)
1139
(95%)
Overlapping structure of complexes
Data set #P In 1 CX In 2 CX In 3 CX In > 3 CX
DIP-
CYC2008
1175 1005
(86%)
134
(11%)
23
(2%)
13
(1%)
BG-
CYC2008
1342 1166
(87%)
139
(10%)
24
(2%)
13
(1%)
STR-
CYC2008
1341 1165
(87%)
139
(10%)
24
(2%)
13
(1%)
BG-
CORUM
2227 909
(41%)
483
(21%)
233
(11%)
602
(27%)
STR-
CORUM
2067 852
(42%)
430
(20%)
217
(10%)
568
(28%)
Core & Peel
Protein complex prediction for large protein protein interaction networks with the Core&Peel
mehod. M Pellegrini, M Baglioni, F Geraci. BMC Bioinformatics 17 ((Suppl 12)), 37–58 (2016)
Some Definitions of “density”
• G=(V,E), V= Vertex set, E= Edge set. d(x)= degree of x.
V
E
Gav
2
)( 
Density:
)1(
2
)(


VV
E
G
Conductance:  SVS
SVSE
VSG
(),(min
),(
min




Sx
xdS )()(
Average degree:
Algorithmic ideas behind Core&Peel
• Consider the following planted clique detection
problem (Kucera ‘95, Deket et al. ‘10).
• Take a random graph G=G(n,p). Embed Kk in G
obtain G’(n,p,k).
• If k is o.d.g of the max degree in G w.h.p. the
following simple algorithm works well:
(1) Sort the nodes of G’ by degree.
(2) Take the top c nodes v1,..,vc for c=3,4,5….
(3) Verify if v1,..,vc induce a complete graph.
Why this works?
• The planted clique (dense sub-graph) of a certain
size induces a bias in an easy “measure” so that
the clique nodes emerge as higher in rank than
random non-clique nodes (background).
• Thus sorting by this measure highlights the
possible good seed candidates pushing them to
the top of the list.
• Verification of a candidate set is relatively easy
and local.
Making things less simple…
• Consider the clique number of a node v in G:
Clq(v,G), that is the size of the largest clique in
G incident to v.
• Clearly degree(v) is an upper bound to Clq(v),
it is easy to compute but often not too tight.
• So we look for a better upper bound to Clq(v)
that (1) can be computed efficiently (2) can be
easily extended to quasi-cliques.
Degeneracy and Core decomposition
• The core decomposition CD(G) assigns to each
node in v its core number cn(v).
• cn(v)=k if v belongs to a maximal induced
subgraph of min degree k.
• cn(v) is the largest integer k you can assign to v,
so that at least k neighbors w of v in G have
cn(w)>=k.
• Observation 1: cn(v) is an upper bound to Clq(v).
• Observation 2: CD(G) can be computed in linear
time O(n+m) [Batagelj-Zaversnik, 2003].
• Thus it is a good candidate to replace the degree
as a convenient upper bound.
Last phase: Charikar peeling+
• In [Charikar 2000] it is shown that recursively
peeling the node of lowest degree in G gives a
(1/2)-approximation to the densest subgraph
(measured as average degree).
• The intuition is that if we start with a graph
that is sufficiently dense, then the optimum
densest subgraph is often the same object
with either definitions.
Core&Peel in one shot
• (1) Compute the Core Decompostition of G.
• (2) Sort the vertices by decreasing core number
(solve ties by |Nr(v,c(v))|).
• (3) For each node v in turn:
• (4) Extract G[Nr(v,c(v))]
• (5) if it is above 50% density, apply Peeling+, stop
when sufficiently dense
• (6) Remove from the final output list any dense
subgraph completely included in another.
• (7) Remove from the final output almost
duplicates by Jaccard similarity above a threshold
Experiments with PPI networks
• 6 PPI-networks: Biogrid-yeast, String-HS, DIP-yeast,
Biogrid-HS, Biogrid-HS-UBC, String-HS.
• 2 Complex DB: CYC2008-yeast, Corum-HS
• 10 Competitors: MCL , COACH, MCODE, MCL-Caw,
CMC, ProRank, Spici, RNSC, ClusterOne, CFinder.
• 1 Evaluation metric : F-measure from [Li et al. BMC
genomics, 2010, 11:S3], for ω=0.2.
• 3 Evaluation metrics: J-measure, PR-measure, SS-
measure from [Song et al., Bioinformatics, 2009]
• Make a Aggregated Quality Index by summing them.
• Measure GO enrichment by BH-corrected p-values in
hypergoemetric tests (q-values).
Testing Overview
Prediction-Alg1 prediction Evaluator scores
Gene
Ontology
Validated
Complexes
PPI-
Network
Prediction-Alg2
Prediction-Algn
prediction
prediction
Objective 1: pick the champion for the problem
Objective 2: hedge strength/weak features of several methods
F-measure - yeast
0
0.1
0.2
0.3
0.4
0.5
0.6
DIP-y BG-y STR-y
C&P
MCL
COACH
MCODE
CMC
MCL-CAW
ProRank
Spici
ClusterOne
RNSC
Cfinder
Rand1
Rand2
Rand3
Semantic Similarity - yeast
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
DIP-y BG-y STR-y
C&P
MCL
COACH
MCODE
CMC
MCL-CAW
ProRank
Spici
ClusterOne
RNSC
Cfinder
Rand1
Rand2
Rand3
Aggregated Score - yeast
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
DIP-y BG-y STR-y
C&P
MCL
COACH
MCODE
CMC
MCL-CAW
ProRank
Spici
ClusterOne
RNSC
Cfinder
Rand1
Rand2
Rand3
F-measure - hs
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
BG-hs BG-hs-UBC STR-hs
C&P
MCL
COACH
MCODE
CMC
MCL-CAW
ProRank
Spici
ClusterOne
RNSC
Cfinder
Rand1
Rand2
Rand3
Semantic Similarity - hs
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
BG-hs BG-hs-UBC STR-hs
C&P
MCL
COACH
MCODE
CMC
MCL-CAW
ProRank
Spici
ClusterOne
RNSC
Cfinder
Rand1
Rand2
Rand3
Aggregated Score - hs
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
BG-hs BG-hs-UBC STR-hs
C&P
MCL
COACH
MCODE
CMC
MCL-CAW
ProRank
Spici
ClusterOne
RNSC
Cfinder
Rand1
DIP-y (hypergeometric q-values)
0
500
1000
1500
2000
2500
exp(-2) exp(-3) exp(-4) exp(-5) exp(-6) exp(-7)
C&P
MCL
Coach
Mcode
CMC
MCLCAW
PRPlus
Spici
Clusterone
RNSC
Cfinder
Bg-y (hypergeometric q-values)
0
500
1000
1500
2000
2500
3000
3500
exp(-2) exp(-3) exp(-4) exp(-5) exp(-6) exp(-7)
C&P
MCL
Coach
Mcode
CMC
MCLCAW
PRPlus
Spici
Clusterone
RNSC
Cfinder
Str-y (hypergeometric q-values)
0
500
1000
1500
2000
2500
3000
3500
exp(-2) exp(-3) exp(-4) exp(-5) exp(-6) exp(-7)
C&P
MCL
Coach
Mcode
CMC
MCLCAW
PRPlus
Spici
Clusterone
RNSC
Cfinder
Bg-hs (hypergeometric q-values)
0
1000
2000
3000
4000
5000
6000
7000
exp(-2) exp(-3) exp(-4) exp(-5) exp(-6) exp(-7)
C&P
MCL
Coach
Mcode
CMC
PRPlus
Spici
Clusterone
RNSC
Cfinder
Bg-hs-UBC (hypergeometric q-values)
0
1000
2000
3000
4000
5000
6000
7000
exp(-2) exp(-3) exp(-4) exp(-5) exp(-6) exp(-7)
C&P
MCL
Coach
Mcode
CMC
MCLCAW
PRPlus
Spici
Clusterone
RNSC
Cfinder
Str-hs (hypergeometric q-values)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
exp(-2) exp(-3) exp(-4) exp(-5) exp(-6) exp(-7)
C&P
MCL
Coach
Mcode
CMC
MCLCAW
PRPlus
Spici
Clusterone
RNSC
Cfinder
Time (in seconds) in Log10 scale : str-hs
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
STR-hs
STR-hs
Runs optimizing the f-measure for each algorithm
Time (in seconds) in Log10 scale: bg-hs-UBC
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
BG-Hs-UBC
BG-Hs-UBC
Runs optimizing the f-measure for each algorithm
Weghted vs unweighted PPI Networks
T. Nepusz, H.
Yu, and A.
Paccanaro
Detecting
overlapping
protein
complexes in
protein-
protein
interaction
networks
Nature
Methods, vol.
9, pp. 471-472,
2012.
PPIN and Cancer studies
• Objective 1: - Identification of Cancer Modules.
• Objective 2: - Identification of Candidate Drug Target (bridges between
modules, bottlenecks, inter-modular hubs)
• Dynamic modularity in protein interaction networks predicts breast cancer outcome. Taylor
IW, Linding R, Warde-Farley D, Liu Y, Pesquita C, Faria D, Bull S, Pawson T, Morris Q, Wrana JL.
Nat Biotechnol. 2009 Feb;27(2):199-204.
• A network module-based method for identifying cancer prognostic signatures. Guanming Wu
and Lincoln Stein, Genome Biology 2012 13:R112
• Disease biomarker identification from gene network modules for metastasized breast cancer.
Pooja Sharma, Dhruba K. Bhattacharyya, and Jugal Kalita, Sci Rep. 2017; 7: 1072.
• The OncoPPi network of cancer-focused protein–protein interactions to inform biological
insights and therapeutic strategies, Zenggang Li et al. Nature Communications . volume 8,
Article number: 14356 (2017)
• Disease Module Identification DREAM Challenge. bioRxiv 265553 (2018). doi:
https://doi.org/10.1101/265553
Conclusions /Perspectives
• Other types of Molecular networks (gene co-
expression networks, signaling networks,
metabolic networks).
• Tissue specific and disease specific
information.
• Evaluation /Ranking of clusters/communities
produced by ‘in silico’ methods.
• Disease modules vs cancer modules
Conclusions/future work
• Compare it to more tools…
• Use Inter-species PPI conservation as an
additional ranking criterion.
• Introduce co-expression data (tissue/process
specificity).
• Identify classes of complexes differentially
predicted by different tools.
• Extension to “functional modules” prediction
• Objective: study the “interactome” using large
PPI networks with efficient algorithms.

CDAC 2018 Pellegrini clustering ppi networks

  • 1.
    Clustering/Community Detection in LargeProtein-Protein Interaction networks Marco Pellegrini IIT- CNR CDAC2018, May 22-25, 2018
  • 2.
  • 3.
    Why clustering PPIN •PPIN give a (static) system-level overview of proteomic machinery. • PPIN have a natural modularity structure • Functional modules, protein complexes appears as dense sub-networks (in absolute or relative sense) • PPIN topology gives hints at to the role of Proteins and Protein Interactions (e.g. hubs). • Diseases produce network perturbations.
  • 4.
    Excursus on PPINclustering algorithms • Methods from disciplines of Computer Science/OR/Graph Theory/Social Network Analysis (e.g. MCL, RNSC, Cfinder). • Methods specifically designed for PPI and based on topology only (MCODE,ClusterONE,..). • Methods designed for PPI, based on abstracting a complex model e.g. core-attachment (e.g. COACH, CORE,…) • Methods incorporating additional biological annotations (e.g. functional coherence, evolutionary conservation, co-location, PPI 3D interface constraints, dynamics of PPI, gene co- expression,..) (e.g. DECAFF, MATISSE,…)
  • 5.
    Timeline • 2000/2002 :MCL • 2003: MCODE • 2004: RNCS • 2005 LCMA • 2006: Cfinder, CDPlus • 2007: PCP, DECAFF • 2009: COACH, CMC, HACO, CORE, RRW • 2010: CFA, MCL-Caw, SPICi • 2011: HC-PIN, • 2012: ClusterONE, Prorank, Weak-Ties, CMBI • 2013: WPNCA, ppsampler2, C-BNMF, PCIA, • 2014: RFC, C-element….=27 and counting..
  • 6.
    Large PPI networksand Protein Complexes • 5 PPI-networks: Biogrid-yeast, String-HS, DIP- yeast, Biogrid-HS, String-HS. • 2 Complex DB: CYC2008-yeast, Corum-HS • Basic measurements to validate the following working hypothesis for large PPIN: • PC are ego-networks • PC are highly dense • PC may be highly overlapping
  • 7.
    Some PPI Datasets - statistics Database species proteins interactions av degree DIP yeast 4.637 21.107 9,1 Giogrid yeast 6.686 220.499 65,9 String yeast 5.590 133.082 47,6 Biogrid Homo S. 18.170 137.775 15,1 String Homo S. 12.717 193.150 30,3
  • 8.
    Size of PPInetworks used 0 50000 100000 150000 200000 250000 y y y hs hs DIP Biogrid String Biogrid String proteins interactions
  • 9.
    Complexes data sets Database species Num comple xes Compl. Size >2 Comp l. size ≤ 2 Num proteins CORUM Homo sapiens 1750 1257 493 2506 CYC2008 yeast 408 236 172 1627 Used in the evaluation
  • 10.
    Are Complexes Dense? Dataset # CX Min size Max size mean Δ > 0.9 Δ > 0.5 DIP- CYC2008 236 3 40 6.02 60 (25%) 131 (55%) BG- CYC2008 236 3 81 6.67 173 (73%) 223 (94%) STR- CYC2008 236 3 81 6.67 220 (93%) 235 (99%) BG- CORUM 1257 3 143 6.12 516 (41%) 943 (75%) STR- CORUM 1188 3 133 6.07 621 (52%) 981 (82%)
  • 11.
    Are Complexes Ego-nets Dataset # CX R1 > 0.9 R1>0.5 DIP-CYC2008 236 131 (55%) 197 (83%) BG-CYC2008 236 216 (91%) 234 (99%) STR-CYC2008 236 235 (99%) 236 (100%) BG-CORUM 1257 891 (70%) 1162 (92%) STR-CORUM 1188 923 (77%) 1139 (95%)
  • 12.
    Overlapping structure ofcomplexes Data set #P In 1 CX In 2 CX In 3 CX In > 3 CX DIP- CYC2008 1175 1005 (86%) 134 (11%) 23 (2%) 13 (1%) BG- CYC2008 1342 1166 (87%) 139 (10%) 24 (2%) 13 (1%) STR- CYC2008 1341 1165 (87%) 139 (10%) 24 (2%) 13 (1%) BG- CORUM 2227 909 (41%) 483 (21%) 233 (11%) 602 (27%) STR- CORUM 2067 852 (42%) 430 (20%) 217 (10%) 568 (28%)
  • 13.
    Core & Peel Proteincomplex prediction for large protein protein interaction networks with the Core&Peel mehod. M Pellegrini, M Baglioni, F Geraci. BMC Bioinformatics 17 ((Suppl 12)), 37–58 (2016)
  • 14.
    Some Definitions of“density” • G=(V,E), V= Vertex set, E= Edge set. d(x)= degree of x. V E Gav 2 )(  Density: )1( 2 )(   VV E G Conductance:  SVS SVSE VSG (),(min ),( min     Sx xdS )()( Average degree:
  • 15.
    Algorithmic ideas behindCore&Peel • Consider the following planted clique detection problem (Kucera ‘95, Deket et al. ‘10). • Take a random graph G=G(n,p). Embed Kk in G obtain G’(n,p,k). • If k is o.d.g of the max degree in G w.h.p. the following simple algorithm works well: (1) Sort the nodes of G’ by degree. (2) Take the top c nodes v1,..,vc for c=3,4,5…. (3) Verify if v1,..,vc induce a complete graph.
  • 16.
    Why this works? •The planted clique (dense sub-graph) of a certain size induces a bias in an easy “measure” so that the clique nodes emerge as higher in rank than random non-clique nodes (background). • Thus sorting by this measure highlights the possible good seed candidates pushing them to the top of the list. • Verification of a candidate set is relatively easy and local.
  • 17.
    Making things lesssimple… • Consider the clique number of a node v in G: Clq(v,G), that is the size of the largest clique in G incident to v. • Clearly degree(v) is an upper bound to Clq(v), it is easy to compute but often not too tight. • So we look for a better upper bound to Clq(v) that (1) can be computed efficiently (2) can be easily extended to quasi-cliques.
  • 18.
    Degeneracy and Coredecomposition • The core decomposition CD(G) assigns to each node in v its core number cn(v). • cn(v)=k if v belongs to a maximal induced subgraph of min degree k. • cn(v) is the largest integer k you can assign to v, so that at least k neighbors w of v in G have cn(w)>=k. • Observation 1: cn(v) is an upper bound to Clq(v). • Observation 2: CD(G) can be computed in linear time O(n+m) [Batagelj-Zaversnik, 2003]. • Thus it is a good candidate to replace the degree as a convenient upper bound.
  • 19.
    Last phase: Charikarpeeling+ • In [Charikar 2000] it is shown that recursively peeling the node of lowest degree in G gives a (1/2)-approximation to the densest subgraph (measured as average degree). • The intuition is that if we start with a graph that is sufficiently dense, then the optimum densest subgraph is often the same object with either definitions.
  • 20.
    Core&Peel in oneshot • (1) Compute the Core Decompostition of G. • (2) Sort the vertices by decreasing core number (solve ties by |Nr(v,c(v))|). • (3) For each node v in turn: • (4) Extract G[Nr(v,c(v))] • (5) if it is above 50% density, apply Peeling+, stop when sufficiently dense • (6) Remove from the final output list any dense subgraph completely included in another. • (7) Remove from the final output almost duplicates by Jaccard similarity above a threshold
  • 21.
    Experiments with PPInetworks • 6 PPI-networks: Biogrid-yeast, String-HS, DIP-yeast, Biogrid-HS, Biogrid-HS-UBC, String-HS. • 2 Complex DB: CYC2008-yeast, Corum-HS • 10 Competitors: MCL , COACH, MCODE, MCL-Caw, CMC, ProRank, Spici, RNSC, ClusterOne, CFinder. • 1 Evaluation metric : F-measure from [Li et al. BMC genomics, 2010, 11:S3], for ω=0.2. • 3 Evaluation metrics: J-measure, PR-measure, SS- measure from [Song et al., Bioinformatics, 2009] • Make a Aggregated Quality Index by summing them. • Measure GO enrichment by BH-corrected p-values in hypergoemetric tests (q-values).
  • 22.
    Testing Overview Prediction-Alg1 predictionEvaluator scores Gene Ontology Validated Complexes PPI- Network Prediction-Alg2 Prediction-Algn prediction prediction Objective 1: pick the champion for the problem Objective 2: hedge strength/weak features of several methods
  • 23.
    F-measure - yeast 0 0.1 0.2 0.3 0.4 0.5 0.6 DIP-yBG-y STR-y C&P MCL COACH MCODE CMC MCL-CAW ProRank Spici ClusterOne RNSC Cfinder Rand1 Rand2 Rand3
  • 24.
    Semantic Similarity -yeast 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 DIP-y BG-y STR-y C&P MCL COACH MCODE CMC MCL-CAW ProRank Spici ClusterOne RNSC Cfinder Rand1 Rand2 Rand3
  • 25.
    Aggregated Score -yeast 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 DIP-y BG-y STR-y C&P MCL COACH MCODE CMC MCL-CAW ProRank Spici ClusterOne RNSC Cfinder Rand1 Rand2 Rand3
  • 26.
    F-measure - hs 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 BG-hsBG-hs-UBC STR-hs C&P MCL COACH MCODE CMC MCL-CAW ProRank Spici ClusterOne RNSC Cfinder Rand1 Rand2 Rand3
  • 27.
    Semantic Similarity -hs 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 BG-hs BG-hs-UBC STR-hs C&P MCL COACH MCODE CMC MCL-CAW ProRank Spici ClusterOne RNSC Cfinder Rand1 Rand2 Rand3
  • 28.
    Aggregated Score -hs 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 BG-hs BG-hs-UBC STR-hs C&P MCL COACH MCODE CMC MCL-CAW ProRank Spici ClusterOne RNSC Cfinder Rand1
  • 29.
    DIP-y (hypergeometric q-values) 0 500 1000 1500 2000 2500 exp(-2)exp(-3) exp(-4) exp(-5) exp(-6) exp(-7) C&P MCL Coach Mcode CMC MCLCAW PRPlus Spici Clusterone RNSC Cfinder
  • 30.
    Bg-y (hypergeometric q-values) 0 500 1000 1500 2000 2500 3000 3500 exp(-2)exp(-3) exp(-4) exp(-5) exp(-6) exp(-7) C&P MCL Coach Mcode CMC MCLCAW PRPlus Spici Clusterone RNSC Cfinder
  • 31.
    Str-y (hypergeometric q-values) 0 500 1000 1500 2000 2500 3000 3500 exp(-2)exp(-3) exp(-4) exp(-5) exp(-6) exp(-7) C&P MCL Coach Mcode CMC MCLCAW PRPlus Spici Clusterone RNSC Cfinder
  • 32.
    Bg-hs (hypergeometric q-values) 0 1000 2000 3000 4000 5000 6000 7000 exp(-2)exp(-3) exp(-4) exp(-5) exp(-6) exp(-7) C&P MCL Coach Mcode CMC PRPlus Spici Clusterone RNSC Cfinder
  • 33.
    Bg-hs-UBC (hypergeometric q-values) 0 1000 2000 3000 4000 5000 6000 7000 exp(-2)exp(-3) exp(-4) exp(-5) exp(-6) exp(-7) C&P MCL Coach Mcode CMC MCLCAW PRPlus Spici Clusterone RNSC Cfinder
  • 34.
    Str-hs (hypergeometric q-values) 0 500 1000 1500 2000 2500 3000 3500 4000 4500 exp(-2)exp(-3) exp(-4) exp(-5) exp(-6) exp(-7) C&P MCL Coach Mcode CMC MCLCAW PRPlus Spici Clusterone RNSC Cfinder
  • 35.
    Time (in seconds)in Log10 scale : str-hs -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 STR-hs STR-hs Runs optimizing the f-measure for each algorithm
  • 36.
    Time (in seconds)in Log10 scale: bg-hs-UBC -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 BG-Hs-UBC BG-Hs-UBC Runs optimizing the f-measure for each algorithm
  • 37.
    Weghted vs unweightedPPI Networks T. Nepusz, H. Yu, and A. Paccanaro Detecting overlapping protein complexes in protein- protein interaction networks Nature Methods, vol. 9, pp. 471-472, 2012.
  • 38.
    PPIN and Cancerstudies • Objective 1: - Identification of Cancer Modules. • Objective 2: - Identification of Candidate Drug Target (bridges between modules, bottlenecks, inter-modular hubs) • Dynamic modularity in protein interaction networks predicts breast cancer outcome. Taylor IW, Linding R, Warde-Farley D, Liu Y, Pesquita C, Faria D, Bull S, Pawson T, Morris Q, Wrana JL. Nat Biotechnol. 2009 Feb;27(2):199-204. • A network module-based method for identifying cancer prognostic signatures. Guanming Wu and Lincoln Stein, Genome Biology 2012 13:R112 • Disease biomarker identification from gene network modules for metastasized breast cancer. Pooja Sharma, Dhruba K. Bhattacharyya, and Jugal Kalita, Sci Rep. 2017; 7: 1072. • The OncoPPi network of cancer-focused protein–protein interactions to inform biological insights and therapeutic strategies, Zenggang Li et al. Nature Communications . volume 8, Article number: 14356 (2017) • Disease Module Identification DREAM Challenge. bioRxiv 265553 (2018). doi: https://doi.org/10.1101/265553
  • 39.
    Conclusions /Perspectives • Othertypes of Molecular networks (gene co- expression networks, signaling networks, metabolic networks). • Tissue specific and disease specific information. • Evaluation /Ranking of clusters/communities produced by ‘in silico’ methods. • Disease modules vs cancer modules
  • 40.
    Conclusions/future work • Compareit to more tools… • Use Inter-species PPI conservation as an additional ranking criterion. • Introduce co-expression data (tissue/process specificity). • Identify classes of complexes differentially predicted by different tools. • Extension to “functional modules” prediction • Objective: study the “interactome” using large PPI networks with efficient algorithms.